Remove posts

This commit is contained in:
2015-12-19 20:57:54 +00:00
parent 0a9f09b29c
commit e0f4d2dcc5
30 changed files with 0 additions and 4328 deletions

View File

@@ -1,141 +0,0 @@
---
title: Secure Git Repositories
layout: post
tags: [ 'git', 'security' ]
description: Seamless data encryption of repository files.
---
I use private repositories on [GitHub][1], but I still don't feel quite comfortable pushing sensitive data like
passwords, keys, and account information. Typically that information ends up just sitting on my local machine or in my
head ready for me to pull up as needed. It would be much better if that information was a bit more fault tolerant and,
even better, if I could follow similar workflows as the rest of my application code.
After some research I discovered [gist 873637][2] which discusses using `git`'s clean and smudge [filters][4] to pass
files through `openssl` for decryption and encryption. The result is `git`'s indexes only containing encrypted file
contents in base64. Soon I found [`shadowhand/git-encrypt`][3].
## Initial Setup
First, I did a one-time install of `shadowhand/git-encrypt` on my machine:
{% highlight console %}
$ git clone git://github.com/shadowhand/git-encrypt.git /usr/local/git-encrypt
$ chmod +x /usr/local/git-encrypt/gitcrypt
$ ln -s /usr/local/git-encrypt/gitcrypt /usr/local/bin/gitcrypt
{% endhighlight %}
Next, I created a new repo and use `gitcrypt init` to set things up:
{% highlight console %}
$ mkdir fort-knox && cd !$
$ git init
Initialized empty Git repository in /private/tmp/fort-knox/.git/
$ gitcrypt init
Generate a random salt? [Y/n] Y
Generate a random password? [Y/n]Y
What encryption cipher do you want to use? [aes-256-ecb]
This configuration will be stored:
salt: 7d9f6cc1512aa2b5
pass: EAC8405A-DD64-43A3-A17F-EB28195B4B1E
cipher: aes-256-ecb
Does this look right? [Y/n] Y
Do you want to use .git/info/attributes? [Y/n] n
What files do you want encrypted? [*]
{% endhighlight %}
Now I just have to be sure to securely keep the salt and pass elsewhere for the next time I setup this repo. Other than
that, it's ready for me to use like any other `git` repository.
## A Practical Bit
Since I won't frequently be setting up this repository, it'd probably be best if I could keep a reminder about what I'll
need to do. So I update `.gitattributes` to exclude itself and `README` from encryption:
{% highlight vim %}
* filter=encrypt diff=encrypt
README -filter -diff
.gitattributes -filter -diff
[merge]
renormalize=true
{% endhighlight %}
And include the necessary commands and reference in `README`:
{% highlight console %}
Remember...
git clone git@github.com:dpb587/fort-knox.git fort-knox && cd !$
gitcrypt init # https://github.com/shadowhand/git-encrypt
git reset --hard HEAD
{% endhighlight %}
So, my first commit looks like:
{% highlight console %}
$ git add .
$ git commit -m 'initial commit'
[master (root-commit) 1077d71] initial commit
2 files changed, 7 insertions(+)
create mode 100644 .gitattributes
create mode 100644 README
{% endhighlight %}
## Under the Hood
Originally I was a bit curious and wanted to verify that it's doing what I thought. So I created a simple test file:
{% highlight console %}
$ date > top-secret.txt
$ cat top-secret.txt
Mon Jan 7 15:11:22 MST 2013
$ git add top-secret.txt
$ git commit -m 'top secret information'
[master dd2272a] top secret information
1 file changed, 1 insertion(+)
create mode 100644 top-secret.txt
{% endhighlight %}
After committing I can look at the raw index data to see what's actually being stored:
{% highlight console %}
$ git ls-tree HEAD
100644 blob 6a9e000e136a20858f65188f849d0bffed48a685 .gitattributes
100644 blob 2221766ff8694dffa1e11ea5d0e7acd213e22d90 README
100644 blob e847f7c05236ac1111a0f5495da87fec188d5420 top-secret.txt
$ git cat-file -p 2221766ff8694dffa1e11ea5d0e7acd213e22d90
Remember...
git clone git@github.com:dpb587/fort-knox.git fort-knox && cd !$
gitcrypt init # https://github.com/shadowhand/git-encrypt
git reset --hard HEAD
$ git cat-file -p e847f7c05236ac1111a0f5495da87fec188d5420
U2FsdGVkX199n2zBUSqitTy46rTQ8tytPxnYmmdBahPCL5u1SwnPcYcDN+KFNgom
{% endhighlight %}
As expected, `README` is readable, but `top-secret.txt` is not. I can manually verify my secret data is still there by
decoding it with my key:
{% highlight console %}
$ git cat-file -p e847f7c05236ac1111a0f5495da87fec188d5420 | openssl base64 -d -aes-256-ecb -k "EAC8405A-DD64-43A3-A17F-EB28195B4B1E"
Mon Jan 7 15:11:22 MST 2013
{% endhighlight %}
## Summary
With `gitcrypt` I can work with a repository and enjoy extra security on top of the redundancy and version control that
`git` provides. The only difference from my regular repos is I can't really view my files from [github.com][1] (with the
convenient exception of `README`).
[1]: https://github.com/
[2]: https://gist.github.com/873637
[3]: https://github.com/shadowhand/git-encrypt
[4]: http://git-scm.com/book/ch7-2.html#Keyword-Expansion

View File

@@ -1,158 +0,0 @@
---
title: Terminating Gearman Workers in PHP
layout: post
tags: [ 'deploy', 'gearman', 'pcntl', 'php' ]
description: Locally and remotely stopping workers without interrupting jobs.
code: https://gist.github.com/dpb587/4531728
---
I use [Gearman][1] as a queue/job server. An application gives it a job to do, and Gearman passes the job along to a
worker that can finish it. Handling both synchronous and asynchronous tasks, the workers can be running anywhere -- the
same server as Gearman, a server across the country, or even a workstation at a local office.
This makes things a bit complicated when it comes time to push out software or configuration changes to workers. When
controlling workers locally, PHP's [gearman module][2] doesn't have a built-in way to terminate a worker without
possibly interrupting a running job. And by design, Gearman cannot broadcast a job to every worker, nor send a generic
job to a specific worker. I wanted a way where I could:
* ask a worker to stop in the middle of its task (standard `SIGINT`)
* ask a worker to stop after its current task
* remotely terminate a worker
* remotely terminate all workers
Even after doing a bit of [research][10] [and][10] [reading][12] [posts][13] there didn't seem to be a fully agreeable,
developed solution. So, I took an afternoon to figure things out, with the working result ending up in a [gist][gist]
and some of the background below.
## Graceful Termination
For the first part, it was simply a matter of handling a `SIGTERM` signal with PHP's [pcntl module][3] and setting a
termination flag. The main worker loop could then check the flag every time it finished a job and cleanly exit. The
Gearman library complicated things a bit though because while it's waiting for a job, none of the signals are
acknowledged. The workaround was to use its non-blocking alternative. Although it still seemed to do some blocking, it
was at least a configurable duration. Abbreviated, [worker.php][gist-worker.php] looks like:
{% highlight php %}{% raw %}
<?php declare(ticks = 1);
$terminate = false;
pcntl_signal(SIGTERM, function () use (&$terminate) { $terminate = true; });
$worker = new GearmanWorker();
$worker->addOptions(GEARMAN_WORKER_NON_BLOCKING);
$worker->setTimeout(2500);
$worker->addServer();
$worker->addFunction(...);
while ((!$terminate) && ($worker->work())) {
$worker->wait();
}
{% endraw %}{% endhighlight %}
When sent a `SIGTERM` while running a job, it would wait to finish before exiting:
{% highlight console %}{% raw %}
$ (php worker.php test1 &)
[15:45:33] READY test1 (25244)
$ php queue.php sleep 20
[15:45:37] ASLEEP test1
$ kill -s TERM 25244
[15:45:39] SIGTERM test1
[15:45:57] AWAKE test1
[15:45:57] EXIT test1
{% endraw %}{% endhighlight %}
## Remote Termination
Sometimes it's easier to remotely terminate workers when they need new code or configuration (and allowing a process
manager to restart them). Since Gearman doesn't support sending a job to every single worker, an alternative is to have
a terminate function for every worker (as mentioned in [this][5] response). Assuming every worker has a unique
identifier, this becomes trivial:
{% highlight php %}{% raw %}
$worker->addFunction(
'_worker_' . $context['id'],
function (GearmanJob $job) {
if ('terminate' == $job->workload()) {
posix_kill(getmypid(), SIGTERM);
}
}
);
{% endraw %}{% endhighlight %}
From the console, it looks like:
{% highlight console %}{% raw %}
$ (php worker.php test1 &)
[16:19:33] READY test1 (25372)
$ php queue.php _worker_test1 terminate
[16:19:38] SIGTERM test1
[16:19:38] EXIT test1
{% endraw %}{% endhighlight %}
## Batch Remote Termination
So now I can remotely terminate workers as needed. However, during deploys it's much more common to ask all the workers
to restart. Using Gearman's [protocol][4] to find running workers I can distribute the termination job and then wait
until all workers have received it. The result was [`terminate.php`][gist-terminate.php], working something like.
{% highlight console %}{% raw %}
$ (php worker.php test1 &) ; (php worker.php test2 &) ; (php worker.php test3 &) ; (php worker.php test4 &)
[16:37:55] READY test1 (25479)
[16:37:55] READY test3 (25483)
[16:37:55] READY test2 (25481)
[16:37:55] READY test4 (25485)
$ php queue.php sleep 4 ; php queue.php sleep 8 ; php queue.php sleep 16
[16:37:57] ASLEEP test2
[16:37:57] ASLEEP test3
[16:37:57] ASLEEP test4
$ php terminate.php
[16:37:59] UP test4
[16:37:59] UP test3
[16:37:59] UP test2
[16:37:59] UP test1
[16:37:59] SIGTERM test1
[16:37:59] EXIT test1
[16:37:59] DOWN test1
[16:38:01] AWAKE test2
[16:38:01] SIGTERM test2
[16:38:01] EXIT test2
[16:38:01] DOWN test2
[16:38:05] AWAKE test3
[16:38:05] SIGTERM test3
[16:38:05] EXIT test3
[16:38:05] DOWN test3
[16:38:08] waiting for: test4
[16:38:13] AWAKE test4
[16:38:13] SIGTERM test4
[16:38:13] EXIT test4
[16:38:13] DOWN test4
{% endraw %}{% endhighlight %}
## Summary
The result is an extra bit of code, but it makes automating tasks, especially around deploys, much easier. This really
just demonstrates one method of creating an internal workers API - termination is just one possibility. Other more
complex possibilities could be self-performing updates, lighter config reloads (instead of full restarts), or
dynamically registering/unregistering functions depending on application load.
[gist]: https://gist.github.com/dpb587/4531728
[gist-terminate.php]: https://gist.github.com/dpb587/4531728#file-terminate-php
[gist-worker.php]: https://gist.github.com/dpb587/4531728#file-worker-php
[1]: http://gearman.org/
[2]: http://php.net/manual/en/book.gearman.php
[3]: http://php.net/manual/en/book.pcntl.php
[4]: http://gearman.org/protocol
[5]: http://stackoverflow.com/questions/7663922/gearman-using-php-possible-to-send-job-message-to-all-workers/7664139#7664139
[10]: http://gearman.org/php_reference
[11]: https://groups.google.com/forum/?fromgroups=#!topic/gearman/ST6Ikw7__kY
[12]: http://stackoverflow.com/questions/2270323/stopping-gearman-workers-nicely
[13]: http://stackoverflow.com/questions/7663922/gearman-using-php-possible-to-send-job-message-to-all-workers

View File

@@ -1,64 +0,0 @@
---
title: OpenGrok CLI
layout: post
tags: [ 'opengrok', 'php', 'symfony', 'xpath' ]
description: Making it easier to search code from the command line.
code: https://github.com/dpb587/opengrok-cli
---
One tool that makes my life as a software developer easier is [OpenGrok][1] - it lets me quickly find application code
and it knows more context than a simple `grep`. It has a built-in web interface, but sometimes I want to work with
search results from the command line (particularly for automated tasks). Since I couldn't find an API, I created a
command to load and parse results using [symfony/console][3] and [xpath][4].
## Usage
It's straightforward to use, just provide the OpenGrok server, project to search, and the query. Mimicking grep, the
output format should look familiar:
{% highlight console %}
$ opengrok-cli --server=http://lxr.php.net --project=PHP_5_4 oci_internal_debug
/ext/oci8/oci8.c:777: PHP_FUNCTION(oci_internal_debug);
/ext/oci8/oci8.c:862: PHP_FE(oci_internal_debug, arginfo_oci_internal_debug)
/ext/oci8/oci8.c:932: PHP_FALIAS(ociinternaldebug, oci_internal_debug, arginfo_oci_internal_debug)
/ext/oci8/oci8_interface.c:1307: /* {{ "{{{" }} proto void oci_internal_debug(int onoff)
/ext/oci8/oci8_interface.c:1309: PHP_FUNCTION(oci_internal_debug)
{% endhighlight %}
When run from an ANSI-friendly terminal, the output is nicely colorized. And just like the web interface, the `query`
argument can include operators, nested queries, field specifiers, and wildcard searches.
It also has a `--list` option to only output paths. Useful if I'm in the repository's top-level and I want to work
through all the results with `vim`:
{% highlight console %}
$ cd php-src/
$ export OPENGROK_SERVER=http://lxr.php.net OPENGROK_PROJECT=PHP_5_4
$ vim $(opengrok-cli --list refs:PHP_MODE_PROCESS_STDIN)
{% endhighlight %}
## Open Source
I published the code to [dpb587/opengrok-cli][5]. Check the `README`, but it's easy to get started:
{% highlight console %}
$ git clone git://github.com/dpb587/opengrok-cli.git opengrok-cli && cd !$
$ php composer.phar install
$ ./bin/opengrok-cli --help
{% endhighlight %}
Or take the easier route and use the pre-compiled version:
{% highlight console %}
$ wget static.dpb587.me/opengrok-cli.phar
$ php opengrok-cli.phar --help
{% endhighlight %}
[1]: http://hub.opensolaris.org/bin/view/Project+opengrok/
[2]: http://ctags.sourceforge.net/
[3]: http://symfony.com/doc/master/components/console/introduction.html
[4]: http://us.php.net/domxpath
[5]: https://github.com/dpb587/opengrok-cli

View File

@@ -1,39 +0,0 @@
---
title: Scripting Endicia to Purchase Postage
layout: post
tags: [ 'applescript', 'endicia', 'loopy' ]
description: Automating user interactions with AppleScript.
code: https://gist.github.com/dpb587/4660132
---
We currently use [Endicia for Mac][1] for postage processing at Loopy. We rarely use the UI since I've scripted most of
it, but one annoyance had been to regularly open it up and add postage since it doesn't reload automatically. If we
happen to forget, it ends up blocking things until we notice. I finally got around to scripting that, too.
## Scripted
In real life, whenever the balance gets too low it throws up an alert and you need to click through a few menus, select
a purchase amount, and confirm the selection before the application will continue. Using [System Events][2], it can all
be conveniently automated. Using [the script][4] I wrote, $500 can be purchased by running:
{% highlight console %}
$ osascript endicia-postage-purchase.applescript 500
ok
{% endhighlight %}
With that step automated, it can be tied in with the `endiciatool` output -- whenever `<Balance />` drops below $30,
automatically kick off the script to buy more postage.
## Summary
So now that's one less manual step everybody has to worry about, saving some time and hassle. If you happen to be new to
[Endicia][3], you should check them out (and use the promotional code <code>599888</code>). Their software has been a
valuable timesaver for us.
[1]: http://www.dymoendicia.com/segments/all-products/endicia-for-mac
[2]: https://developer.apple.com/library/mac/#documentation/applescript/conceptual/applescriptx/Concepts/as_related_apps.html#//apple_ref/doc/uid/TP40001570-1149074-BAJEIHJA
[3]: http://www.dymoendicia.com/
[4]: https://gist.github.com/dpb587/4660132#file-endicia-purchase-postage-applescript

View File

@@ -1,173 +0,0 @@
---
title: Automating Backups to the Cloud
layout: post
tags: [ 'backup', 'gpg', 's3' ]
description: Combining gpg, Amazon S3 and IAM policies.
---
Backups are extremely important and I've been experimenting with a few different methods. My concerns are always focused
on maintaining data integrity, security, and availability. One of my current methods involves using asymmetric keys for
secure storage and object versioning to ensure backup data can't undesirably be overwritten.
## Encryption Keys
For encryption and decryption I'm using asymmetric keys via [`gpg`][1]. This way, any server can generate and encrypt
the data, but only administrators who have the private key could actually decrypt the data. Generating the
administrative key looks like:
{% highlight console %}{% raw %}
$ gpg --gen-key
gpg (GnuPG) 1.4.11; Copyright (C) 2010 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
... [snip] ...
gpg: key CEFAF45B marked as ultimately trusted
public and secret key created and signed.
gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0 valid: 4 signed: 0 trust: 0-, 0q, 0n, 0m, 0f, 4u
pub 2048R/CEFAF45B 2013-02-08
Key fingerprint = 46DF 2951 7E2D 41D7 F7B5 EB16 20C2 1C03 CEFA F45B
uid Danny Berger (secret-project-backup) <dpb587@gmail.com>
sub 2048R/765C4556 2013-02-08
{% endraw %}{% endhighlight %}
To actually use the public key on servers, it can be exported and copied...
{% highlight console %}{% raw %}
$ gpg --armor --export 'Danny Berger (secret-project-backup) <dpb587@gmail.com>'
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1.4.11 (Darwin)
... [snip] ...
-----END PGP PUBLIC KEY BLOCK-----
{% endraw %}{% endhighlight %}
Then pasted and imported on the machine(s) that will be encrypting data...
{% highlight console %}{% raw %}
$ cat | gpg --import
gpg: directory `/home/app-devtools/.gnupg' created
gpg: new configuration file `/home/app-devtools/.gnupg/gpg.conf' created
gpg: WARNING: options in `/home/app-devtools/.gnupg/gpg.conf' are not yet active during this run
gpg: keyring `/home/app-devtools/.gnupg/secring.gpg' created
gpg: keyring `/home/app-devtools/.gnupg/pubring.gpg' created
... [snip] ... Ctrl-D ...
gpg: /home/app-devtools/.gnupg/trustdb.gpg: trustdb created
gpg: key CEFAF45B: public key "Danny Berger (secret-project-backup) <dpb587@gmail.com>" imported
gpg: Total number processed: 1
gpg: imported: 1 (RSA: 1)
{% endraw %}{% endhighlight %}
And then marked as "ultimately trusted" with the `trust` command (otherwise it always wants to confirm before using the
key)...
{% highlight console %}{% raw %}
$ gpg --edit-key 'Danny Berger (secret-project-backup) <dpb587@gmail.com>'
... [snip] ...
pub 2048R/CEFAF45B created: 2013-02-08 expires: never usage: SC
trust: ultimate validity: unknown
sub 2048R/765C4556 created: 2013-02-08 expires: never usage: E
[ unknown] (1). Danny Berger (secret-project-backup) <dpb587@gmail.com>
Please note that the shown key validity is not necessarily correct
unless you restart the program.
Command> quit
{% endraw %}{% endhighlight %}
## Amazon S3
In my case, I wanted to regularly send the encrypted backups offsite and [S3][2] seemed like a flexible, effective
storage place. This involved a couple steps:
**Create a new S3 bucket** (e.g. `backup.secret-project.example.com`) - this will just hold all the different backup
types and files for the project.
**Enable Object Versioning** on the S3 bucket - whenever a new backup gets dropped off, previous backups will remain.
This provides for additional security (e.g. a compromised server could not overwrite the backup with an empty file) and
more complex retention policies than Amazon's Glacier lifecycle rules.
**Create a new IAM user** (e.g. `com-example-secret-project-backup`) - the user and it's Access Key will be responsible
for uploading the backup files to the bucket.
**Add a User Policy** to the IAM user - the only permission it needs is `PutObject` for the bucket:
{% highlight javascript %}
{
"Statement" : [
{
"Sid" : "Stmt0123456789",
"Action" : [
"s3:PutObject"
],
"Effect" : "Allow",
"Resource" : [
"arn:aws:s3:::backup.secret-project.example.com/*"
]
}
]
}
{% endhighlight %}
**Upload Method** - instead of depending on third-party libraries for uploading the backup files, I wanted to try simply
using `curl` with Amazon S3's Browser-Based Upload functionality. This involved creating and signing the appropriate
policy via the [sample][3] policy builder for a particular backup type. My simple policy looked like:
{% highlight javascript %}
{
"expiration" : "2016-01-01T12:00:00.000Z",
"conditions" : [
{ "bucket" : "backup.secret-project.example.com" },
{ "acl" : "private" },
{ "key" : "database.sql.gz.enc" },
]
}
{% endhighlight %}
## All Together
Putting everything together, a single command could be used to backup the database, compress, encrypt, and upload:
{% highlight console %}
$ mysqldump ... \
| gzip -c \
| gpg --recipient 'Danny Berger (secret-project-backup) <dpb587@gmail.com>' --encrypt \
| curl \
-F key=database.sql.gz.enc \
-F acl=private \
-F AWSAccessKeyId=AKIA99076E3F28E55AF85 \
-F policy=ewogICJleHBpcmF0aW9uIiA6ICIyMDE2LTAxLTAxVDEyOjAwOjAwLjAwMFoiLAogICJjb25kaXRpb25zIiA6IFsKICAgIHsgImJ1Y2tldCIgOiAiYmFja3VwLnNlY3JldC1wcm9qZWN0LmV4YW1wbGUuY29tIiB9LAogICAgeyAiYWNsIiA6ICJwcml2YXRlIiB9LAogICAgeyAia2V5IiA6ICJkYXRhYmFzZS5zcWwuZ3ouZW5jIiB9LAogIF0KfQ== \
-F signature=937ca778e4d44db7b804cfdd70d= \
-F file=@- \
https://s3.amazonaws.com/backup.secret-project.example.com
{% endhighlight %}
And then to download, decrypt, decompress, and reload the database from an administrative machine:
{% highlight console %}
$ wget -qO- 'https://s3.amazonaws.com/backup.secret-project.example.com/database.sql.tgz.enc?versionId=c0b55912f42c4142bcb44c3eb1376f35&AWSAccessKeyId=AKIA99076E3F28E55AF85&...' \
| gpg -d \
| gunzip \
| mysql ...
{% endhighlight %}
The only task remaining is creating a cleanup script using the S3 API to monitor the different backup versions and
delete them as they expire.
## Summary
While it has a bit of overhead to get things set up properly, using `gpg` makes secure backups trivial and S3 provides
the flexible storage strategy to ensure data is safe.
[1]: http://www.gnupg.org/
[2]: http://aws.amazon.com/s3/
[3]: http://s3.amazonaws.com/doc/s3-example-code/post/post_sample.html

View File

@@ -1,62 +0,0 @@
---
title: Using Facter in Ant Scripts
layout: post
tags: [ 'ant', 'facter' ]
description: Reusing facts from build scripts.
---
After using [puppet][1] for a while I have become use to some of the facts that [facter][2] automatically provides. When
working with [ant][3] build scripts, I started wishing I didn't have to generate similar facts myself through various
`exec` calls.
## One Fact
Instead of fragile lookups like...
{% highlight xml %}
<exec executable="/bin/bash" outputproperty="lookup.eth0">
<arg value="-c" />
<arg value="/sbin/ifconfig eth0 | grep 'inet addr' | awk -F: '{print $2}' | awk '{print $1}'" />
</exec>
{% endhighlight %}
I can simplify it with...
{% highlight xml %}
<exec executable="/usr/bin/facter" outputproperty="lookup.eth0">
<arg value="ipaddress_eth0" />
</exec>
{% endhighlight %}
## In Bulk
Or I can load all facts with...
{% highlight xml %}
<tempfile property="tmp.facter.properties" deleteonexit="true" />
<exec executable="/bin/bash" output="${tmp.facter.properties}" failonerror="true">
<arg value="-c" />
<arg value="/usr/bin/facter -p | /bin/sed -e 's/ => /=/'" />
</exec>
<property file="${tmp.facter.properties}" prefix="facter" />
{% endhighlight %}
And reference a fact in my task...
{% highlight xml %}
<exec executable="${basedir}/bin/configure-env">
<arg value="--set-listen" />
<arg value="${facter.ipaddress_eth0}" />
</exec>
{% endhighlight %}
## Summary
So now it's much easier to reference environment information from property files (via interpolation), make targets more
conditional, and, of course, within actual tasks.
[1]: https://puppetlabs.com/puppet/what-is-puppet/
[2]: https://puppetlabs.com/puppet/related-projects/facter/
[3]: http://ant.apache.org/

View File

@@ -1,122 +0,0 @@
---
title: A Generic Storage Interface
layout: post
tags: [ 'asset', 'php', 'storage' ]
description: Abstracting file storage, whether it's local or cloud.
---
Websites often have a lot of different assets and files for the various areas of a website - content management systems,
photo galleries, e-commerce product photos, etc. As a site grows, so does storage demand and backup requirements, and as
storage demands grow it typically becomes necessary to distribute those files across multiple servers or services.
One method for managing disparate file systems is to use custom PHP [stream wrappers][4] and configurable paths; but
some extensions don't yet support custom wrappers for file access. An alternative that I've been using is an object and
service-oriented approach to keep my application code independent from the storage configuration.
## Interface
At the core of my design, is the asset storage interface which looks something like:
{% highlight php %}
<?php interface StorageEngineInterface {
// store a file and return back a token that can be used to retrieve it
function store(SplFileInfo $file);
// retrieve a locally-accessible SplFileInfo based on the token
function retrieve($token);
// remove data from storage based on the token
function purge($token);
}
{% endhighlight %}
The storage engine is responsible for generating a reusable token that can be used for later retrieval. Generally, I
simply have it generate a UUID as the token, however tokens could have storage-specific meaning.
## Sample Storage Engines
I've used several base implementations:
* `LocalStorageEngine` - the simplest storage using a local/NFS filesystem
* `AWSS3StorageEngine` - using [AWS S3][1] for storage
* `SftpStorageEngine` - using PHP's [ssh2][2] module to access files on servers via SFTP
* `AtlassianConfluenceStorageEngine` - managing documents within [Confluence][3] wikis
Remote services like AWS S3 and SFTP can cause significant performance issues. To help with that, I use a
`CachedStorageEngine` implementation. It accepts two `StorageEngineInterface` arguments: one as the upstream engine, and
one as the local cache. For example:
{% highlight php %}
<?php
new CachedStorageEngine(
new AWSS3StorageEngine(new Aws\S3\S3Client(...), 'bucket.example.com', 'my-prefix'),
new LocalStorageEngine('/tmp/s3-bucket.example.com-cache')
);
{% endhighlight %}
And since `CachedStorageEngine` is just another implementation of `StorageEngineInterface`, it can be used
interchangeably within the application with performance being the only difference.
## Application Usage
Using dependency injection, each of the storage backends becomes an independent service, configured depending on the
application requirements. The application then has no storage-specific calls like `copy`, `file_get_contents`, `fopen`,
etc and the code looks something like:
{% highlight php %}
<?php
// storage service for photos
$storage = $dic->get('photo_storage')
// save a new photo
$photo = new PhotoRecord();
$photo->setAssetToken(
$storage->store($request->files->get('upload'))
);
// use the photo
$image = (new Imagine\Gd\Imagine())->open(
$storage->retrieve($photo->getAssetToken())
);
// delete the photo
$storage->purge($photo->getAssetToken());
$photo->delete();
{% endhighlight %}
Since `retrieve` will always return a [`SplFileInfo`][5] instance, it can be referenced and handled like a local file
(as demonstrated by the `open` call in the example.
## Complicating Things
The asset storage interface itself is fairly primitive, but it allows for some more complex configurations:
* by using dependency injection, it becomes extremely easy to switch storage engines since application code doesn't
need to change
* complex storage rules can be combined with meaningful tokens to, for example, store very large files on different
disks and using a token prefix to identify that class
* creating a fallback storage class which will go through a chain of storages searching until it's able to store or
retrieve a token
* internally deferring operations via queue manager (e.g. instead of storing files immediately to S3 and waiting for
upload time, write it locally and create a job to upload it in the background)
## Summary
By abstracting storage logic outside of my application code, it makes my life much more easier as a developer and as a
systems administrator when trying to manage where files are located and any relocations, as necessary.
[1]: http://aws.amazon.com/s3/
[2]: http://www.php.net/manual/en/book.ssh2.php
[3]: http://atlassian.com/software/confluence/overview/team-collaboration-software
[4]: http://www.php.net/manual/en/class.streamwrapper.php
[5]: http://us.php.net/manual/en/class.splfileinfo.php

View File

@@ -1,32 +0,0 @@
---
title: Path-based tmpfile in PHP
layout: post
tags: [ 'php' ]
description: When paths are more useful than resources.
---
PHP has the [`tmpfile`][1] function for creating a file handle which will automatically be destroyed when it is closed
or when the script ends. PHP also has the [`tempnam`][2] function which takes care of creating the file and returning
the path, but doesn't automatically destroy the file.
To get the best of both worlds (temp file + auto-destroy), I have found this useful:
{% highlight php %}
<?php
function tmpfilepath() {
$path = stream_get_meta_data(tmpfile())['uri'];
register_shutdown_function(
function () use ($path) {
unlink($path);
}
);
return $path;
}
{% endhighlight %}
[1]: http://php.net/manual/en/function.tmpfile.php
[2]: http://php.net/manual/en/function.tempnam.php

View File

@@ -1,181 +0,0 @@
---
title: Comparing PHP Application Definitions
layout: post
tags: [ 'code', 'diff', 'language', 'php', 'xslt' ]
description: Identifying how classes/interfaces changed between versions.
code: https://github.com/dpb587/diff-defn.php
---
While working to update a PHP project, I thought it'd be helpful if I could systematically qualify significant code
changes between versions. I could weed through a massive line diff, but that's costly if many changes aren't ultimately
affecting my API dependencies. Typically I only care about how interfaces and classes change in their usage of methods,
method arguments, variables, and scope.
I did a bit of research on the idea and found [several][7] [different][8] [questions][9], a few [referenced][10]
[products][11], and a short [article][12] on the idea. However I wasn't able to find a good PHP (or even generic) option
which was open-source and something I could easily try out.
To that end, I made a prototype for a language-intelligent/OOP-diff/structured diff engine that can summarize many of
the programmatic changes in an easily readable report which links definitions back to their file and line number for
more detailed review...
<img alt="symfony/console example" height="343" src="{{ site.asset_prefix }}/blog/2013-03-07-comparing-php-application-definitions/console-diff.png" width="536" />
## Usage
If I were upgrading my application with a [`symfony/Console`][1] dependency from `v2.0.22` to `v2.2.0`, I could generate
the diff of definitions with:
{% highlight console %}
$ git clone git://github.com/dpb587/diff-defn.php.git diff-defn.php && cd !$
$ php composer.phar install
$ ./bin/diff-defn.php diff:show --exclude='/Tests/' git://github.com/symfony/Console.git v2.0.22 v2.2.0 > output.html
$ open output.html
{% endhighlight %}
Take a look at several other reports using the default stylesheet:
* [`doctrine/dbal`][2] (`2.1.7` &rarr; `2.3.2`)
* [`fabpot/Twig`][3] (`v1.10.0` &rarr; `v1.12.2`)
* [`symfony/symfony`][4] (`v2.0.22` &rarr; `v2.2.0`)
* [`zendframework/zf2`][5] (`release-2.0.0` &rarr; `release-2.1.3`)
## Behind the Scenes
The logic behind the command looks like:
1. Use version control to diff the two versions and see what files were changed.
2. Use [nikic/php-parser][6] to parse the PHP files in both their initial and final commit...
* Build separate structures for both the initial and final code states.
* Use visitors to analyze definitions, language or application-specific definitions.
3. Use some logic to compare the initial and final structures and create a new structured diff with only the relevant
definitions that changed (including both old and new).
4. Apply a stylesheet to the diff structure to generate human-readable output.
The structures are simple classes which can be dumped to XML. And technically, aside from the step of parsing of PHP
files, this is very language-agnostic. For example, the XML representation of the initial or final commit looks like:
{% highlight xml %}
<root id="root">
<defn id="source" repository="git://github.com/symfony/Security.git" repository-link="https://github.com/symfony/Security/" file-link="https://github.com/symfony/Security/blob/%commit%/%file%#L%line%" commit-link="https://github.com/symfony/Security/tree/%commit%">
<defn id="commit" value="8cd00e30f4a13b0c57c5d98613c3dd533bc1c35a" friendly="v2.0.22"/>
</defn>
<class id="Symfony\Component\Security\Http\Firewall\UsernamePasswordFormAuthenticationListener">
<defn-source id="source" file="Http/Firewall/UsernamePasswordFormAuthenticationListener.php" line="33"/>
<class-extends id="Symfony\Component\Security\Http\Firewall\AbstractAuthenticationListener"/>
<class-property id="csrfProvider">
<defn-source id="source" file="Http/Firewall/UsernamePasswordFormAuthenticationListener.php" line="35"/>
<defn-attr id="visibility" value="private"/>
</class-property>
<function id="__construct">
<defn-source id="source" file="Http/Firewall/UsernamePasswordFormAuthenticationListener.php" line="40"/>
<defn-attr id="visibility" value="public"/>
<function-param id="securityContext">
<defn-attr id="typehint" value="Symfony\Component\Security\Core\SecurityContextInterface"/>
</function-param>
<!-- ... -->
<function-param id="providerKey"/>
<function-param id="options">
<defn-attr id="default" type="array" value="[]"/>
<defn-attr id="typehint" value="array"/>
</function-param>
<!-- ... -->
<function-param id="logger">
<defn-attr id="default" type="const" value="null"/>
<defn-attr id="typehint" value="Symfony\Component\HttpKernel\Log\LoggerInterface"/>
</function-param>
<!-- ... -->
</function>
<function id="attemptAuthentication">
<defn-source id="source" file="Http/Firewall/UsernamePasswordFormAuthenticationListener.php" line="56"/>
<defn-attr id="visibility" value="protected"/>
<function-param id="request">
<defn-attr id="typehint" value="Symfony\Component\HttpFoundation\Request"/>
</function-param>
</function>
</class>
</root>
{% endhighlight %}
And after the initial and final commit are compared, the resulting structured diff looks like:
{% highlight xml %}
<root id="root" diff="touched">
<defn id="source" repository="git://github.com/symfony/Security.git" repository-link="https://github.com/symfony/Security/" file-link="https://github.com/symfony/Security/blob/%commit%/%file%#L%line%" commit-link="https://github.com/symfony/Security/tree/%commit%" diff="touched">
<defn id="commit" value="9e53793548e403c155d28a01153026905ee53d5d" friendly="v2.2.0" diff="changed">
<diff-old id="old">
<defn id="commit" value="8cd00e30f4a13b0c57c5d98613c3dd533bc1c35a" friendly="v2.0.22"/>
</diff-old>
</defn>
</defn>
<class id="Symfony\Component\Security\Http\Firewall\UsernamePasswordFormAuthenticationListener" diff="touched">
<defn-source id="source" file="Http/Firewall/UsernamePasswordFormAuthenticationListener.php" line="33"/>
<function id="__construct" diff="touched">
<defn-source id="source" file="Http/Firewall/UsernamePasswordFormAuthenticationListener.php" line="40"/>
<function-param id="logger" diff="touched">
<defn-attr id="typehint" value="Psr\Log\LoggerInterface" diff="changed">
<diff-old id="old">
<defn-attr id="typehint" value="Symfony\Component\HttpKernel\Log\LoggerInterface"/>
</diff-old>
</defn-attr>
</function-param>
</function>
<function id="requiresAuthentication" diff="added">
<defn-source id="source" file="Http/Firewall/UsernamePasswordFormAuthenticationListener.php" line="56" diff="added"/>
<defn-attr id="visibility" value="protected" diff="added"/>
<function-param id="request" diff="added">
<defn-attr id="typehint" value="Symfony\Component\HttpFoundation\Request"/>
</function-param>
</function>
</class>
</root>
{% endhighlight %}
## Going Further
Being able to parse files and have their differences stored in static, semi-agnostic format allows for some interesting
usages:
* search for specific changes, like which class methods have had their typehint changed (e.g. xpath
`//class/function/function-param/defn-attr[@id="typehint" and @diff="changed" and @value="Psr\Log\LoggerInterface"]`)
* combine search results with other automated tools for updating impacted application code or explicitly requiring
reviews for changes breaking compatibility standards
* generating lists about new interfaces/classes, dropped definitions, newly limited scopes
* when using test naming conventions, specifically test and verify the code's tests are run
* instead of simple "lines of code" stats, also track classes/methods/functions
* writing post-commit rules based on definition searches (e.g. email a maintainer whenever a critical class is touched)
Since the analysis and the serialized, static representation are distinct steps, this also allows for custom,
application-specific analysis information like:
* in aspect-oriented code, analyzing `@Aspects(...)` and including them in reports
* tying code-linting tool results to flag specific methods/properties that have issues
* additional flags to monitor if function logic changed vs formatting/comments (even if the API is unchanged)
And unlike some of the other tools I ran into, the static representation is not itself inherently readable; it needs a
stylesheet to make it human-friendly. This makes the results potentially reusable for multiple different reports.
## Summary
I've published this work-in-progress code to [dpb587/diff-defn.php][13] in case you want to try it out with your own PHP
repositories. It's certainly not a replacement of reading changelogs and understanding what upstream changes are being
made, but I have found it interesting and helpful to identifying breaking changes.
[1]: https://github.com/symfony/Console
[2]: http://static.dpb587.me/2013-03-07-comparing-php-application-definitions/doctrine-dbal-2.1.7..2.3.2.html
[3]: http://static.dpb587.me/2013-03-07-comparing-php-application-definitions/fabpot-Twig-v1.10.0..v1.12.2.html
[4]: http://static.dpb587.me/2013-03-07-comparing-php-application-definitions/symfony-symfony-v2.0.22..v2.2.0.html
[5]: http://static.dpb587.me/2013-03-07-comparing-php-application-definitions/zendframework-zf2-release-2.0.0..release-2.1.3.html
[6]: https://github.com/nikic/php-parser
[7]: http://stackoverflow.com/questions/77931/do-you-know-of-any-language-aware-diffing-tools
[8]: http://stackoverflow.com/questions/2828795/is-there-a-language-aware-diff
[9]: http://discuss.fogcreek.com/joelonsoftware5/default.asp?cmd=show&ixPost=155585&ixReplies=18
[10]: http://www.semdesigns.com/Products/SmartDifferencer/index.html
[11]: http://www.schneidersoft.com/Products/OOP-DIFF/OOP-DIFF.aspx
[12]: http://www.itworld.com/software/231515/usenix-dartmouth-expanding-diff-grep-unix-tools
[13]: https://github.com/dpb587/diff-defn.php

View File

@@ -1,47 +0,0 @@
---
title: Using HTML Headers with wkhtmltopdf
layout: post
tags: [ 'headers', 'wkhtmltopdf' ]
description: Experimenting with dynamic HTML headers for PDFs.
---
Preparing for my job search, I really wanted to somehow reuse the content from my [about][2] page for my
r&#233;sum&#233; instead of trying to also maintain the information in a Word/Google Drive file. Mac OS X has the
convenient capability to convert any print to a PDF which is helpful in creating a general print-specific stylesheet for
browsers, but it still had a few drawbacks. One of those drawbacks is headers - I expect to see them on even the
simplest professional documents. Having used [`wkhtmltopdf`][1] before, I knew it could be a solution.
I started by creating a simple [header file][3] to include my name, my website, document name, and page information. I
also created a new CSS class which would take care of hiding headers and footers since they just take up extra space and
are being replaced. By using a few extra arguments, `wkhtmltopdf` does a brilliant job at creating a professional
document:
wkhtmltopdf \
--print-media-type \
--run-script 'document.body.className+=" alt-printarticle";' \
--margin-left 8mm --margin-right 8mm --margin-top 20mm \
--header-spacing 3 \
--header-html 'http://localhost:4000/include/content/header-simple.html?doctitle=r%26%23233%3Bsum%26%23233%3B' \
--title 'resume' \
'http://localhost:4000/about.html' \
resume.pdf
Once that was working, I applied a few other tricks to make the printout a bit nicer:
* [`page-break-inside`][5] &ndash; to prevent specific lines from breaking across pages (e.g. keeping the job title and
company lines together)
* `a` tag styling &ndash; suppressing underlines and visual differences since they make less sense when printed on
paper
* `.screen-only` and `.print-only` classes &ndash; to show slightly different content when printing (e.g. showing
company website addresses instead of a generically linked "website" that looks simpler on browser screens)
Finally, after a bit of experimenting, learning, and styling, I can now present a consistent r&#233;sum&#233; (and cover
letters, references, &hellip;) whether it's through [PDF file][4] or [web page][2]. When viewing as a PDF, it has the
added benefits of remaining interactive with embedded links.
[1]: https://code.google.com/p/wkhtmltopdf/
[2]: /about.html
[3]: https://github.com/dpb587/dpb587.me/blob/master/static/dev/content/header-simple.html
[4]: http://static.dpb587.me/about.pdf
[5]: https://developer.mozilla.org/en-US/docs/CSS/page-break-inside

View File

@@ -1,41 +0,0 @@
---
title: Bank Card Readers for Web Applications
layout: post
tags: [ 'bank card', 'forms', 'javascript', 'reader' ]
description: Scanning credit cards into website forms.
code: https://gist.github.com/dpb587/5229239
---
I made a web-based [point of sale][1] for [The Loopy Ewe][2], but it needed an easier way to accept credit cards aside
from manually typing in the credit card details. To help with that, we got a keyboard-emulating USB magnetic card reader
and I wrote a [parser][3] for the [card data][4] and convert it to an object. It is fairly simple to hook up to a form
and enable a card to be scanned while the user is focused in the name or number fields...
{% highlight javascript %}
require(
[ 'payment/form/cardparser', 'vendor/mootools' ],
function (paymentFormCardparser) {
function storeCard(card) {
$('payment[card][name]').value = card.name;
$('payment[card][number]').value = card.number;
$('payment[card][expm]').value = card.expm;
$('payment[card][expy]').value = card.expy;
$('payment[card][code]').focus();
}
paymentFormCardparser
.listen($('payment[card][name]'), storeCard)
.listen($('payment[card][number]'), storeCard)
;
}
);
{% endhighlight %}
It acts as a very passive listener without requiring the user to do anything special - if there is no card reader
connected then the form field is simply a regular field for keyboard input.
[1]: http://en.wikipedia.org/wiki/Point_of_sale
[2]: http://www.theloopyewe.com/
[3]: https://gist.github.com/dpb587/5229239#file-cardparser-js
[4]: http://en.wikipedia.org/wiki/Magnetic_stripe_card

View File

@@ -1,169 +0,0 @@
---
title: New Website for The Loopy Ewe
layout: post
tags: [ 'elasticsearch', 'migration', 'redesign', 'theloopyewe' ]
description: A summary of the customer-facing changes I worked on for the site.
---
I've spent the past several months working on some website changes for [The Loopy Ewe][1]. On Thursday I was able to
push many of those frontend changes out. I thought I'd briefly discuss some of those changes here.
## Before and After
First off, it's fun to show before and after screenshots of many key areas...
### Home Page
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-homepage.jpg" width="308" />
<a href="http://theloopyewe.com/"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-homepage.jpg" width="308" /></a>
So the home page is one of the first welcome pages to new visitors. I wanted to make sure it was warm and welcoming,
primarily through the central photos we show; the default one being the entry view of our shop (with a dynamic thumbnail
of our webcam in the bottom right). Over time we'll be able to rotate through different photos for different events,
product updates, and more clever things.
I wanted to get rid of the multi-color sidebar from every page so it could be better filled with more useful,
page-specific content. Visually, I increased the page width from 784px to 960px, so combined with dropping the sidebar
it allows for about 75% more content area.
Previously the sidebar was the main method of navigation, so I regrouped the old blue navigation link box into about 6
different topics to use as the main header links.
Instead of a simple, almost-non-existant footer on the old site, I took advantage of that area to include store
information, social links, payment options, and numerous other credentials that customers can appreciate.
### Contact Us
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-contactus.jpg" width="308" />
<a href="http://www.theloopyewe.com/contact/"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-contactus.jpg" width="308" /></a>
Contact information is important for customers. In addition to the information now being in the footer, there is a
cleaner page with a new interactive map to help people visually realize where exactly the shop is located.
### Wonderful Customers
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-testimonials.jpg" width="308" />
<a href="http://www.theloopyewe.com/about/wonderful-customers/"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-testimonials.jpg" width="308" /></a>
It's always nice to be able to show feedback customers send in. The new site reorganizes everything in a nicer, more
readable way, and on separate pages. It's also much simpler to submit a testimonial through the on-screen form.
### Shop
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-shop.jpg" width="308" />
<a href="http://www.theloopyewe.com/shop/"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-shop.jpg" width="308" /></a>
Generally speaking, I wanted the photos to be the main defining experience that a visitor has. To that end, product
photos became significantly larger in an effort to fill in the missing colors of the simple color palette I used.
Since it's the main shop page, I also included useful links like new products, gift certificates, search, and links for
browsing by some attributes.
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-shop-category.jpg" width="308" />
<a href="http://www.theloopyewe.com/shop/g/yarn/cascade/"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-shop-category.jpg" width="308" /></a>
Within specific shop categories, I only slightly increased the thumbnails and instead favored focusing more on the
different brands and their distinctions.
One other significant addition to the new website is the social sharing functionality. On most shop pages, there are new
social sharing links to Twitter, Pinterest, and Facebook. Using a custom short domain and campaign URL arguments, we can
get better insight into customer interests.
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-shop-brand.jpg" width="308" />
<a href="http://www.theloopyewe.com/shop/g/yarn/cascade/220/"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-shop-brand.jpg" width="308" /></a>
In my opinion, one of the best changes has been to viewing products on pages like this. Using a sidebar to show the
description and attributes allows customers to more quickly see the enticing and larger product photos together.
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-shop-product.jpg" width="308" />
<a href="http://www.theloopyewe.com/shop/p/F2FDB8A1-220-8905-Robin-Egg-Blue"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-shop-product.jpg" width="308" /></a>
I think the second best improvement is the individual product page where the photo takes precedence and shows off the
quality of the product. A larger call-to-action makes it easier to add the item to carts and wishlists. I reorganized
the product information as well to better prioritize it, visually.
<p style="line-height:inherit;">
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-shop-search-grid.jpg" width="308" />
<a href="http://www.theloopyewe.com/shop/search/a/fiber-weight/fingering-weight/availability/in-stock/?q=red"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-shop-search-grid.jpg" width="308" /></a>
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-shop-search-list.jpg" width="308" />
<a href="http://www.theloopyewe.com/shop/search/a/fiber-weight/fingering-weight/availability/in-stock/?q=red&amp;r%5Bview%5D=list-tn"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-shop-search-list.jpg" width="308" /></a>
</p>
One major feature addition has been a real search engine. The old site used some complex and inefficient database
queries (which actually caused noticeable performance issues at rare times). With the new site, all the products are
properly indexed and searched via [elasticsearch][2]. I'm looking forward to adding more elasticsearch integrations on
the site in the future.
### Help
<img alt="Screenshot: before" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/old-help.jpg" width="308" />
<a href="http://www.theloopyewe.com/help/"><img alt="Screenshot: after" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-help.jpg" width="308" /></a>
Previously we had a single, text-heavy and difficult to read help page, also known as "frequently asked questions." The
new site breaks things down into different topics and adds creative pictures to make things more readable. There's also
a new inline form where customers can ask for help instead of bothering to open an email client and compose an email.
## New Stuff
Although I disabled a number of things for later release and chatter, it's always fun to include some completely new
functionality...
### Local
<a href="http://www.theloopyewe.com/local/classes.html"><img alt="Screenshot: web page" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-local.jpg" width="628" /></a>
I created a new topic dedicated to our local customers. Since it's not only an online store anymore, we wanted a way to
publicize some of the local activities that Fort Collins people would be interested in. It also lets online-only
customers see how we exist and work in real life to create more of a connection.
### About
<a href="http://www.theloopyewe.com/about/loopy-central/fort-collins.html"><img alt="Screenshot: web page" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-about.jpg" width="628" /></a>
Along with a local page, I also wanted a better page for showing our real world existence so customers could feel more
connected and understand both who and where they're purchasing from.
### Shop Attributes
<a href="http://www.theloopyewe.com/shop/a/fiber-material/merino-wool/"><img alt="Screenshot: web page" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-shop-attribute.jpg" width="628" /></a>
In an effort to make navigating the shop easier, I created new pages to view products by attributes in a more organized
way. If somebody is interested in "Fingering Weight" they can easily see all the companies and brands that offer it. If
they need more complicated searches, there's an Advanced Search link at the bottom of each page.
### Site Feedback
<a href="http://www.theloopyewe.com/contact/site-feedback.html?uri=%2Fshop%2Fg%2Fyarn%2Fthe-loopy-ewe%2Floopy-cakes%2F"><img alt="Screenshot: web page" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-sitefeedback.jpg" width="628" /></a>
For both the cases of bugs and hearing ideas for improvement, I wanted to be sure visitors could easily send technical
feedback. Links at the footer of every page include information like what page they were looking at, what browser,
authenticated username information, and whatever notes they want to add.
### humans.txt
<a href="http://www.theloopyewe.com/humans.txt"><img alt="Screenshot: web page" src="{{ site.asset_prefix }}/blog/2013-04-27-new-website-for-the-loopy-ewe/new-humans.jpg" width="628" /></a>
Whenever possible, I like discussing and linking to technical resources that I have found useful. For the nerdy types, I
created the `humans.txt` file to document many of the resources that have helped make the website possible.
## Conclusion
So there's the basic overview about some of the less-technical changes. I'm looking forward to several additional
features to rollout over time and help keep things fresh over the next few months. Later blog posts can discuss some of
the more technical processes and decisions that have helped in making the new site.
[1]: http://www.theloopyewe.com/
[2]: http://www.elasticsearch.org/

View File

@@ -1,154 +0,0 @@
---
title: Embeddable and Context-Aware Web Pages
layout: post
tags: [ 'architecture', 'http', 'javascript', 'symfony', 'symfony2' ]
description: Embedding content in an absolutely relative manner.
---
In my [symfony][5] website applications I frequently make multiple subrequests to reuse content from other controllers.
For simple, non-dynamic content this is trivial, but when arguments can change data or when the browser may want to
update those subrequests things start to get complicated. Usually it requires tying the logic of the subrequest
controller in the main request controller (e.g. knowing that the `q` argument needs to be passed to the template, and
then making sure the template passes it in the subrequest). I wanted to simplify it and get rid of those inner
dependencies.
As an example, take a look at this [product search][1]. The [facets][2] and [results][3] are actually subrequests, but
the main results content is taking advantage of the request design I implemented. My goals were:
* remove logic from controller code to keep them independent from each other,
* pages work without JavaScript and without requiring newer browsers,
* pages work the same whether it's a subrequest or a master request, and
* any page should be capable of being a self-contained subrequest.
## Steps
When a subrequest is self-contained, I call it a *subcontext*. These subcontext requests have an additional requirement
of being publicly accessible. In the product search, the [results][3] page is publicly routed and all the pagination and
view links will work properly within the `./results.html` page. This makes it easy for using XHR to load updated
content.
Another minor piece of this design is that views don't need to be fully rendered. This means an Ajax request can ask for
just the page content and exclude the typical header/footer. In [Twig][4] parlance it is a `frag_content` block which
has all the useful content.
When it comes to passing query parameters down through subcontexts, I decided that each subcontext gets its own scoped
variable. So whenever I render a subcontext in a template, I always specify a name for it. The name should be unique
within the template context. In the product search example, the facets subcontext is named `f` and the results
subcontext is named `r`. When a request arrives for `/?r[offset]=54`, the subrequest will arrive at the results
controller looking like `/results.html?offset=54` (which is equivalent to navigating that page directly).
To keep track of the subcontext names, template content, query data, and relative locations I started using a custom
request header named `tle-subcontext`. In practice it looks like:
tle-subcontext: r:content@/shop/search/availability/in-stock/?q=red
When that request header exists it means:
* we're within a subcontext named `r`,
* we want to get the view fragment named `content`, and
* the root URL we started at was `/shop/search/availability/in-stock/?q=red`.
Within the controller code that header information should not be relevant. In templating though it becomes useful for
rewriting URLs. Whenever a template is going to give a link to itself, I wrap it in a custom `subcontext_rewrite`
function. For example, given the `tle-subcontext` configuration above, it would rewrite:
dataset_generic(...snip...)
=> /shop/.../in-stock/results.html?q=red&view=list-tn&offset=54
subcontext_rewrite(dataset_generic(...snip...))
=> /shop/.../in-stock/?q=red&r[view]=list-tn&r[offset]=54#r
The rewritten URL is completely valid and can be accessed without fancy JavaScript calls. Now, to make that possible I
don't use the standard inline renderer in Twig. I created a custom renderer with a little additional logic which takes
care of rewriting the subcontext data and injecting the header:
{% highlight php %}
$rootUri = $request->getRequestUri();
if (preg_match('/^([a-z0-9\-]+):([a-z0-9]+)@(.*)$/', $request->server->get('HTTP_TLE_SUBCONTEXT'), $match)) {
# this means a subcontext already exists and a sub-subcontext is being created
# append our context name to the parent context name
$options['name'] = $match[1] . '-' . $options['name'];
# use the root uri from the header since $request is only a subrequest
$rootUri = $match[3];
# pull out our context-specific query data from the root uri and update our request
parse_str(parse_url($match[3], PHP_URL_QUERY), $rootQuery);
$subRequest->query->replace(isset($rootQuery[$options['name']]) ? $rootQuery[$options['name']] : array());
} elseif ((null !== $subdata = $request->query->get($options['name'])) && (is_array($subdata))) {
# pull out our context-specific query data
$subRequest->query->replace($subdata);
}
# now add the header with all our combined data to the request
$subRequest->server->set(
'HTTP_TLE_SUBCONTEXT',
$options['name'] . ':' . (empty($options['frag']) ? 'content' : $options['frag']) . '@' . $rootUri
);
unset($options['name'], $options['frag']);
{% endhighlight %}
So now whenever I want a subcontext within a view, I can use the custom renderer:
{% highlight jinja %}{% raw %}
{{ render_subcontext(path('search_results', passthru), { 'name' : 'r' }) }}
{% endraw %}{% endhighlight %}
With those simple customizations I no longer have to worry about knowing what parameters need to be passed on to
template subrequests. It also paves the way for some more fancy behavior...
## Adding Some Magic
Since the subcontext pages are publicly accessible, it should be easy to let Ajax reload individual subcontexts without
having to reload the whole page. To enable that, I went ahead and configured subcontext requests to always end up in a
specific layout which will wrap it with the subcontext metadata. The template looks like:
{% highlight jinja %}{% raw %}
<article id="{{ subcontext_name() }}" data-href="{{ app.request.uri }}#{{ subcontext_frag() }}">
<header><h3>{{ block('def_title') }}</h3></header>
<section>{{ block('frag_' ~ subcontext_frag()) }}</section>
</article>
{% endraw %}{% endhighlight %}
The `subcontext_*` custom functions simply peek at the request to find the `tle-subcontext` header and appropriate
values.
Now that the extra data is available, we can have JavaScript build links to send partial requests. If a
`subcontext_rewrite` link is clicked, it's a matter of starting with the `article[@data-href]` value and peeking at the
clicked `a[@href]` to find query parameters that were within the `article[@id]` name. For example:
// window.location
/shop/search/availability/in-stock/?q=red
// subcontext
<article id="r" data-href="/shop/search/availability/in-stock/results.html?q=red#content">
// clicked anchor
<a href="/shop/search/availability/in-stock/?q=red&r[offset]=54#r">
// becomes the request
GET /shop/search/availability/in-stock/results.html?q=red&offset=54
tle-subcontext: tle-subcontext: r:content@/shop/search/availability/in-stock/?q=red
Something easily processable with an Ajax request. And since the clicked anchor was a canonical URL, that easily becomes
the new window URL location by using the [HTML5 History API][6].
## Conclusion
Once I implemented the code snippets for tying all the ideas together, it became much quicker and simpler for me to
embed other dynamic controllers within my requests. So far it has been working out quite well and I no longer have to
worry about page-specific hacks for passing data to subrequests.
[1]: http://www.theloopyewe.com/shop/search/availability/in-stock/?q=red
[2]: http://www.theloopyewe.com/shop/search/availability/in-stock/facets.html?q=red
[3]: http://www.theloopyewe.com/shop/search/availability/in-stock/results.html?q=red
[4]: http://twig.sensiolabs.org/
[5]: http://symfony.com/
[6]: http://diveintohtml5.info/history.html

View File

@@ -1,231 +0,0 @@
---
title: Structured Data with schema.org
layout: post
tags: [ 'product', 'schema.org', 'structured data', 'xpath' ]
description: Ensuring content is useful to both humans and robots.
---
Good website content is important so people can learn and interact, but robots are the ones interpreting content to
figure out if the content is actually useful to people. With the new [website][1] I wanted to be sure I was using
standards and metadata so the content could be programmatically useful. I chose to use the markup from [schema.org][2]
due to its fairly comprehensive data types and broad adoption by search engines.
## Introduction
I think the importance of structured data is growing. Not only does it make things easier for search engines to
consistently interpret content, it can also help encourage properly designed website architecture. For example, if I
want search engines to know what the brand of a product is, it probably means I should ensure the product is linked to
the main brand page. A byproduct of this means a regular user can then click back to the main brand listings as well.
One of the most difficult things about embedding structured data is verifying that the markup looks how I expect. There
are tools on both [Google][5] and [Bing][8] for testing structured data, but they really work best for
publicly accessible pages (not development-local content). I found a [few][6] [other][7] [tools][9], but either they
were limited in their features or had some inconvenient bugs in how they represented data.
Ultimately, I wanted to see the website from a robots perspective and make sure I could traverse it as one. To help
myself out with that, I created a tool which would parse arbitrary local pages into JSON data based on my understanding
of how robots would interpret data. For example, I could view the [home page][1] in [raw JSON][10], or I could pretend I
was a robot and browse it in a [formatted HTML][11] page where links are rewritten for followup.
## Basic Pages
Even basic pages can provide some useful structured data. For example, the page describing the [Loopy Groupies][12]
doesn't have complicated content, but it still uses the basic [`WebPage`][14] type to identify breadcrumbs, titles, main
content, and a significant image on the page. By integrating the main site template, it also identifies the header and
footer as [`WPHeader`][16] and [`WPFooter`][16].
{% highlight json %}
{
"_type": "http://schema.org/WebPage",
"headline": "Loopy Groupies",
"name": "Loopy Groupies",
"breadcrumb": "Home \u00bb About \u00bb Loopy Groupies",
"mainContentOfPage": {
"_type": "http://schema.org/WebPageElement",
"mainContentOfPage": "[Photo: Little Loopy with some fun] So - about those Loopy Groupies - what IS that exactly? In addition to our Loopy Rewards program, with your sixth package, you become an official member of the Loopy Groupie Club. With that, you'll receive: a fun care package to welcome you in (a cool tote with a couple of goodies inside) Loopy Kisses with each order (you'll have to wait to see those in person) advance notice of all new yarns and products right when they go up on the website (although if any of you have a particular yarn line you're watching for, of course you can email us and request to get notice of that yarn early. We're always happy to do that. The only exceptions to that are for Wollmeise, simply because it sells out too quickly for the notice to get to you in time.) an extra appreciation gift a couple of times a year, when we find something fun that we want to include in your order that month! We hope to have YOU as a Loopy Groupie, soon!",
"primaryImageOfPage": {
"_type": "http://schema.org/ImageObject",
"contentUrl": "https://dy2k2bbze5kvv.cloudfront.net/static/9fdc4b5787/web/about/loopy-groupies.jpg"
}
}
}
{% endhighlight %}
Of course it's not limited to [`schema.org`][2] data types. The robot data also includes detailed breadcrumb data in the
[raw JSON][13] structure.
## Products
One of the most useful types in an e-commerce environment is [`SomeProducts`][3]. It lets robots see things like
pricing, inventory, availability, company, model, and various product attributes. For example, here's what our
[Slate Blue][4] product currently looks like to robots:
{% highlight json %}
{
"_type": "http://schema.org/SomeProducts",
"name": "57-61 Slate Blue",
"model": {
"_type": "http://schema.org/ProductModel",
"name": "Solid Series",
"url": "https://www.theloopyewe.com/shop/g/yarn/the-loopy-ewe/solid-series/"
},
"brand": {
"_type": "http://schema.org/Brand",
"name": "The Loopy Ewe",
"url": "https://www.theloopyewe.com/shop/c/the-loopy-ewe/"
},
"offers": {
"_type": "http://schema.org/Offer",
"priceCurrency": "USD",
"itemCondition": "http://schema.org/NewCondition",
"availability": "http://schema.org/InStock",
"price": "11.00"
},
"weight": {
"_type": "http://schema.org/QuantitativeValue",
"unitCode": "ONZ",
"value": "2.00"
},
"image": "https://ehlo-a0.theloopyewe.net/asset/catalog-entry-photo/e9a1b966-2747-11d5-a74b-d0ba4caf395e~v2-702x702.jpg",
"description": "Our Solid Series line brings you 90 solid colors in a smooshy fingering base, perfect for showing off your most intricate sock and shawl designs. You'll also love having this extensive palette of colors to choose from when working with colorwork, whether it's in socks, mitts, gloves, hats, cowls, shawls, or fine sweaters and vests. Dyed exclusively for The Loopy Ewe.",
"inventoryLevel": {
"_type": "http://schema.org/QuantitativeValue",
"unitCode": "SW",
"minValue": "15"
},
"_extra": [
{
"_type": "http://schema.org/QuantitativeValue",
"name": "Fiber Material",
"unitCode": "P1",
"value": "100",
"valueReference": {
"_type": "http://schema.org/QualitativeValue",
"url": "https://www.theloopyewe.com/shop/a/fiber-material/superwash-merino/",
"name": "Superwash Merino"
}
},
{
"_type": "http://schema.org/QualitativeValue",
"url": "https://www.theloopyewe.com/shop/a/fiber-weight/fingering-weight/",
"name": "Fingering Weight"
},
{
"_type": "http://schema.org/QuantitativeValue",
"name": "Yardage",
"unitCode": "YRD",
"value": "220"
},
{
"_type": "http://schema.org/ImageObject",
"contentUrl": "https://ehlo-a0.theloopyewe.net/asset/catalog-entry-photo/e9a1b966-2747-11d5-a74b-d0ba4caf395e~v2-702x702.jpg",
"thumbnailUrl": "https://ehlo-a1.theloopyewe.net/asset/catalog-entry-photo/e9a1b966-2747-11d5-a74b-d0ba4caf395e~v2-96x96.jpg"
},
{
"_type": "http://schema.org/ImageObject",
"contentUrl": "https://ehlo-a1.theloopyewe.net/asset/catalog-entry-photo/4a087c72-bdba-fc92-8867-3b7f1bd4fb24~v2-702x702.jpg",
"thumbnailUrl": "https://ehlo-a1.theloopyewe.net/asset/catalog-entry-photo/4a087c72-bdba-fc92-8867-3b7f1bd4fb24~v2-96x96.jpg"
}
]
}
{% endhighlight %}
With the markup on the page, it's now possible for search engines to quickly show information such as pricing and
availability alongside results for the product. Not only that, but given sufficient parsing it can also infer the
relationships that specific page (marked as a product) has with other product concepts to create a more intelligent data
graph.
## Product Listings
For the main product types, pages also support listings that reference the individual products. The main
[Solid Series][17] listing has the following data:
{% highlight json %}
{
"_type": "http://schema.org/CollectionPage",
"mainContentOfPage": {
"_type": "http://schema.org/WebPageElement",
"_extra": [
{
"_type": "http://schema.org/ItemList",
"_extra": [
{
"_type": "http://schema.org/SomeProducts",
"url": "https://www.theloopyewe.com/shop/p/BEA04EDF-Solid-Series-00-Color-Cards",
"image": "https://ehlo-a0.theloopyewe.net/asset/catalog-entry-photo/1341fe35-3260-4ece-df42-387e9ddcafe5~v2-210x130.jpg",
"name": "00 Color Cards",
"itemCondition": "http://schema.org/NewCondition",
"offers": {
"_type": "http://schema.org/Offer",
"priceCurrency": "USD",
"availability": "http://schema.org/InStock",
"price": "15.00"
},
"inventoryLevel": {
"_type": "http://schema.org/QuantitativeValue",
"unitCode": "SW",
"minValue": "15"
}
},
{
"_type": "http://schema.org/SomeProducts",
"url": "https://www.theloopyewe.com/shop/p/0300A54D-Solid-Series-01-39-White",
"image": "https://ehlo-a0.theloopyewe.net/asset/catalog-entry-photo/cb0e1bb9-1431-ef41-7232-c79a3c510f2a~v2-210x130.jpg",
"name": "01-39 White",
"itemCondition": "http://schema.org/NewCondition",
"offers": {
"_type": "http://schema.org/Offer",
"priceCurrency": "USD",
"availability": "http://schema.org/InStock",
"price": "11.00"
},
"inventoryLevel": {
"_type": "http://schema.org/QuantitativeValue",
"unitCode": "SW",
"minValue": "15"
}
}
]
}
]
}
}
{% endhighlight %}
## Rationale
Nearly all pages on the new [website][1] have at least some structured data present, if only the breadcrumb data. All
this markup isn't simply an academic exercise though. For example, [Ravelry][18] supports checking the pricing and
inventory of our product ads and displaying them to users. Instead of complex, fragile regular expressions or DOM
traversal, we can just say to use an XPath query like `*[@itemscope and @itemtype = "http://schema.org/SomeProducts"]`.
One of the original motivations behind focusing on structured data was the goal of having an internal search for the
site. Instead of writing web page scrapers that know what the DOM looks like and how to find significant content, it has
been much easier to rely on simple `schema.org` types which are consistent across all pages. The structured data on
pages is still a work in progress as I learn more about what robots are interested in and figure out the best way to
represent content.
[1]: https://theloopyewe.com/
[2]: http://schema.org/
[3]: http://schema.org/SomeProducts
[4]: https://theloopyewe.com/shop/p/C7CBF721-Solid-Series-57-61-Slate-Blue
[5]: http://www.google.com/webmasters/tools/richsnippets
[6]: http://linter.structured-data.org/
[7]: http://foolip.org/microdatajs/live/
[8]: http://www.bing.com/toolbox/markup-validator
[9]: https://github.com/linclark/MicrodataPHP
[10]: https://www.theloopyewe.com/api/search/resource/uri.json?uri=%2F&pretty
[11]: https://www.theloopyewe.com/api/search/resource/uri.html?uri=%2F
[12]: https://www.theloopyewe.com/about/loopy-groupies.html
[13]: https://www.theloopyewe.com/api/search/resource/uri.json?uri=%2Fabout%2Floopy-groupies.html&pretty
[14]: http://schema.org/WebPage
[15]: http://schema.org/WPHeader
[16]: http://schema.org/WPFooter
[17]: https://www.theloopyewe.com/shop/g/yarn/the-loopy-ewe/solid-series/
[18]: http://www.ravelry.com/

View File

@@ -1,78 +0,0 @@
---
title: "ti-debug: For Debugging Server Code in the Browser"
layout: post
tags: [ 'debugger', 'node', 'php', 'xdebug', 'webkit' ]
description: Making it easier to debug languages like PHP and Python with only a browser.
---
I find that I am rarely using full IDEs to write code (e.g. [Eclipse][1], [Komodo][6], [NetBeans][3], [Zend Studio][2]).
They tend to be a bit sluggish when working with larger projects, so I favor simplistic editors like [Coda][4] or the
always-faithful [vim][5]. One thing I miss about using full-featured IDEs is their debugging capabilities. They usually
have convenient debugger interfaces that allow stepping through runtime code to investigate bugs.
About a year ago I started a project called [ti-debug][7] with the goal of being able to debug my server-side code (like
[PHP][8]) through [WebKit][13]'s developer tools interface. After getting a functional prototype of it working, I got
distracted with other projects and it dropped lower in the list of my repository activities. That is, until a few weeks
ago when [David][9] from [CityIndex][10] expressed interest in the project. I've been able to spend some sponsored time
in order to finish some of the features, update dependencies, and create a more stable project.
## Functionality
If you're familiar with the WebKit developer tools (also found in [Google Chrome][11]), the interface should look
extremely familiar. The core of `ti-debug` is written in [node.js][12] and when started up, it creates a simple web
server for you to open a browser tab and connect to. While you develop in other tabs, it will wait until there is an
incoming debug session at which point it loads up the debug environment and waits for you to step through code.
<a href="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/waiting-to-debug.jpg"><img alt="Screenshot: waiting for connection" src="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/waiting-to-debug.jpg" width="308" /></a>
<a href="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/initial-pause.jpg"><img alt="Screenshot: waiting for interaction" src="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/initial-pause.jpg" width="308" /></a>
The full stack trace is available along with all the local and global variables. In addition to the basic step
over/into/out, breakpoints can be set throughout the code. When paused, variables can be inspected and explored. In
addition to simple types like strings and booleans, complex objects and arrays can be expanded and further explored.
<a href="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/breakpoints.jpg"><img alt="Screenshot: breakpoint exploration" src="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/breakpoints.jpg" width="628" /></a>
Not only can variables be read, they can also be updated inline by double clicking and entering new values. Or, for more
advanced commands, the console can be used to evaluate application code, possibly updating the runtime.
<a href="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/propset-inline.jpg"><img alt="Screenshot: waiting for connection" src="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/propset-inline.jpg" width="308" /></a>
<a href="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/propset-console.jpg"><img alt="Screenshot: waiting for interaction" src="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/propset-console.jpg" width="308" /></a>
Like most other IDE debuggers, the frontend supports jumping through the various levels in the stack to inspect the
runtime and run arbitrary commands. One other minor feature is watch expressions which are evaulated during every pause.
<a href="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/stack-jumping.jpg"><img alt="Screenshot: waiting for connection" src="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/stack-jumping.jpg" width="308" /></a>
<a href="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/watch-expressions.jpg"><img alt="Screenshot: waiting for interaction" src="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/watch-expressions.jpg" width="308" /></a>
Once a debug session has completed, the debug tab gets redirected back to the waiting page. Or, if the debug tab gets
closed in the middle of the debug session, the debugger will detach from the program and let it run to completion.
PHP isn't the only supported language. By using the debugging modules from [Komodo][14], other languages using the DBGp
communication can also use `ti-debug`. For example, Python scripts can currently be debugged, too...
<a href="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/python.jpg"><img alt="Screenshot: breakpoint exploration" src="{{ site.asset_prefix }}/blog/2013-05-16-ti-debug-for-debugging-server-code-in-the-browser/python.jpg" width="628" /></a>
## Workflow
One of the ways that `ti-debug` can be run is locally for a single developer, but in the case of DBGp, `ti-debug` can
also act as a proxy to support multiple developers, or a combination of developers wanting to use both the browser-based
debugger along with their own local IDEs. This way, `ti-debug` could be running on a central development server to allow
all developers access.
[1]: http://www.eclipse.org/
[2]: http://www.zend.com/products/studio/
[3]: https://netbeans.org/
[4]: http://panic.com/coda/
[5]: http://www.vim.org/
[6]: http://www.activestate.com/komodo-ide
[7]: https://github.com/dpb587/ti-debug
[8]: http://php.net/
[9]: https://github.com/mrdavidlaing
[10]: https://github.com/cityindex
[11]: https://www.google.com/intl/en/chrome/browser/
[12]: http://nodejs.org/
[13]: http://www.webkit.org/
[14]: http://code.activestate.com/komodo/remotedebugging/

View File

@@ -1,315 +0,0 @@
---
title: The Basics of a Custom Search Engine
layout: post
tags: [ 'elasticsearch', 'gearmand', 'schema.org', 'search', 'sitemap', 'structured data' ]
description: Combining elasticsearch and "structured data" to create a self-hosted search engine.
---
One of the most useful features of a website is the ability to search. [The Loopy Ewe][4] has had some form of faceted
product search for a long time, but it has never had the ability to quickly find regular pages, categories, brands, blog
posts and the like. [Google][1] seems to lead in offering custom search products with both [Custom Search Engine][2] and
[Site Search][3], but they're either branded or cost a bit of money. Instead of investing in their proprietary products,
I wanted to try to create a simple search engine for our needs which took advantage of my previous work in implementing
existing open standards.
## Introduction
In my mind, there are four basic processes when creating a search engine:
**Discovery** - finding the documents that are worthy of indexing. This step was fairly easy since I had already setup
a [sitemap][6] for the site. Internally, the feature bundles of the site are responsible for generating their own
sitemap (e.g. blog posts, regular content pages, photo galleries, products, product groups) and [`sitemap.xml`][10] just
advertises them. So, for our purposes, the discovery step just involves reviewing those sitemaps to find the links.
**Parsing** - understanding the documents to know what content is significant. Given my previous work of [implementing
structured data][7] on the site and creating internal tools for reviewing the results, parsing becomes a very simple
task.
The next two processes are more what I want to focus on here:
* **Indexing** - ensuring the documents are accessible via search queries.
* **Maintenance** - keeping the documents updated when they are updated or removed.
## Indexing
We were already using [elasticsearch][8], so I was hoping to use it for full-text searching as well. I decided to
maintain two types in the search index.
### Discovered Documents (`resource`)
The `resource` type has all our indexed URLs and a cache of their contents. Since we're not going to be searching it
directly, it's more of a basic key-based storage based on the URL. The mapping looks something like:
{% highlight javascript %}
{ "_id" : {
"type" : "string" },
"url" : {
"type" : "string",
"index" : "no" },
"response_status" : {
"type" : "string",
"index" : "no" },
"response_headers" : {
"properties" : {
"key" : {
"type" : "string",
"index" : "no" },
"value" : {
"type" : "string",
"index" : "no" } } },
"response_content" : {
"type" : "string",
"index" : "no" },
"date_retrieved" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" },
"date_expires" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" } }
{% endhighlight %}
The `_id` is simply a hash of the actual URL and used elsewhere. Whenever the discovery process finds a new URL, it
creates a new record and queues a task to download the document. The initial record looks like:
{% highlight javascript %}
{
"_id" : "b48d426138096d66bfaa4ac9dcbc4cb6",
"url" : "/local/fling/spring-fling-2013/",
"date_expires" : "2001-01-01 00:00:00"
}
{% endhighlight %}
Then the download task is responsible for:
1. Receiving a URL to download;
2. Finding the current `resource` record;
3. Validating it against `robots.txt`;
4. Sending a new request for the URL (respecting `ETag` and `Last-Modified` headers);
5. Updating the `resource` record with the response and new `date_*` values;
6. And, if the document has changed, queueing a task to parse the `resource`.
By default, if an `Expires` response header isn't provided, I set the `date_expires` field to several days in the
future. The field is used to find stale documents later on.
### Parsed Documents (`result`)
The `result` type has all our indexed URLs which were parsed and found to be useful. The documents contain some
structured fields which are generated by the parsing step. The mapping looks like:
{% highlight javascript %}
{ "_id": {
"type": "string" },
"url": {
"type": "string",
"index": "no" },
"itemtype": {
"type": "string",
"analyzer": "keyword" },
"image": {
"type": "string",
"index": "no" },
"title": {
"boost": 5.0,
"type": "string",
"include_in_all": true,
"position_offset_gap": 64,
"index_analyzer": "snowballed",
"search_analyzer": "snowballed_searcher" },
"keywords": {
"_boost": 6.0,
"type": "string",
"include_in_all": true,
"index_analyzer": "snowballed",
"search_analyzer": "snowballed_searcher" },
"description": {
"_boost": 3.0,
"type": "string",
"analyzer": "standard" },
"crumbs": {
"boost": 0.5,
"properties": {
"url": {
"type": "string",
"index": "no" },
"title": {
"type": "string",
"include_in_all": true,
"analyzer": "standard" } } },
"content": {
"type": "string",
"include_in_all": true,
"position_offset_gap": 128,
"analyzer": "standard" },
"facts": {
"type": "object",
"enabled": false,
"index": "no" },
"date_parsed" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" }
"date_published" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" } }
{% endhighlight %}
A few notes on the specific fields:
* `itemtype` - the generic result type in schema.org terms (e.g. Product, WebPage, Organization)
* `image` - a primary image from the page; it becomes a thumbnail on search results to make them more inviting
* `title` - usually based on the `title` tag or more-concise `og:title` data
* `keywords` - usually based on the keywords `meta` tag (the field is boosted because they're specifically targeted
phrases)
* `description` - usually the description `meta` tag
* `content` - any remaining useful, searchable content somebody might try to find something in
* `facts` - arbitrary data used for rendering more helpful search results; some common keys:
* `collection` - indicates there are multiple of something (e.g. product quantities, styles of a product)
* `product_model` - indicate a product model name for the result
* `brand` - indicate the brand name for the result
* `price`, `priceMin`, `priceMax` - indicate the price(s) of a result
* `availability` - for a product this is usually "in stock" or "out of stock"
* `date_published` - for content such as blog posts or announcements
The `result` type is updated by the parse task which is responsible for:
1. Receiving a URL to parse;
2. Finding the current `resource` record;
3. Run the `response_content` through the appropriate structured data parser;
4. Extract generic data (e.g. title, keywords);
5. Extract `itemtype`-specific metadata, usually for `facts`;
6. Update the `result` record.
For example, this parsed [product model][17] looks like:
{% highlight javascript %}
{ "url" : "/shop/g/yarn/madelinetosh/tosh-dk/",
"itemtype" : "ProductModel",
"title" : "Madelinetosh Tosh DK",
"keywords" : [ "tosh dk", "tosh dk yarn", "madelinetosh", "madelinetosh yarn", "madelinetosh tosh dk", "madelinetosh" ],
"image" : "/asset/catalog-entry-photo/17c1dc50-37ab-dac6-ca3c-9fd055a5b07f~v2-96x96.jpg",
"crumbs": [
{
"url" : "/shop/",
"title" : "Shop" },
{
"url" : "/shop/g/yarn/",
"title" : "Yarn" },
{
"url" : "/shop/g/yarn/madelinetosh/",
"title" : "Madelinetosh" } ],
"content" : "Hand-dyed by the gals at Madelinetosh in Texas, you'll find these colors vibrant and multi-layered. Perfect for thick socks, scarves, shawls, hats, gloves, mitts and sweaters.",
"facts" : {
"collection": [
{
"value" : 93,
"label" : "products" } ],
"brand" : "Madelinetosh",
"price" : "22.00" },
"_boost" : 4 }
{% endhighlight %}
### Searching
Once some documents are indexed, I can create simple searches with the [`ruflin/Elastica`][11] library:
{% highlight php %}
<?php
$bool = (new \Elastica\Query\Bool())
->addMust(
(new \Elastica\Query\Bool())
->setParam('minimum_number_should_match', 1)
->addShould(
(new \Elastica\Query\QueryString())
->setParam('default_field', 'keywords')
/* ...snip... */ )
->addShould(
(new \Elastica\Query\QueryString())
->setParam('default_field', 'title')
/* ...snip... */ )
->addShould(
(new \Elastica\Query\QueryString())
->setParam('default_field', 'content')
/* ...snip... */ ) );
/* ...snip... */
$query = new \Elastica\Query($bool);
{% endhighlight %}
To easily focus specific matches in the `title` and `content` fields I can enable highlighting:
{% highlight php %}
<?php
$query->setHighlight(
array(
'pre_tags' => array('<strong>'),
'post_tags' => array('</strong>'),
'fields' => array(
'title' => array(
'fragment_size' => 256,
'number_of_fragments' => 1 ),
'content' => array(
'fragment_size' => 64,
'number_of_fragments' => 3 ) ) ) );
{% endhighlight %}
## Maintenance
A search engine is no good if it's using outdated or no-longer-existant information. To help keep content up to date, I
take two approaches:
**Time-based updates** - one of the reasons for the indexed `date_expires` field of the `resource` type is so an
process can go through and identify documents which have not been updated recently. If it sees something is stale, it
goes ahead and queues it for update.
**Real-time updates** - sometimes things (like product availability) change frequently, impacting the quality of search
results. Instead of waiting for time-based updates, I use event listeners to trigger re-indexing when it sees things
inventory changes or product changes in an order.
In either case, when a URL is discovered to be gone, the records from both `resource` and `result` are removed for the
URL.
### Utilities
Sometimes there are deploys where specific pages are definitely changing, or when a whole new sitemap is getting
registered with new URLs. Instead of waiting for the time-based updates or cron jobs to run, I have these commands
available for scripting:
* `search:index-rebuild` - re-read the sitemaps and assert the links in the `resource` index
* `search:index-update` - find all the expired resources and queue them for update
* `search:result-rerun` - force the download and parsing of a URL
* `search:sitemap-generate` - regenerate all registered sitemaps
## Conclusion
Starting with structured data and elasticsearch makes building a search engine significantly easier. Data and indexing
makes it faster to show smarter [search results][16]. Existing standards like [OpenSearch][12] make it easy to extend
the search from a web page into the [browser][15] and even third-party applications via [Atom][13] and [RSS][14] feeds.
Local, real-time updates ensures search results are timely and useful. Even with the basic parsing and ranking
algorithms shown here, results are quite accurate. It has been a beneficial experience to approach the website from the
perspective of a bot, giving me a better appreciation of how to efficiently markup and market content.
[1]: http://www.google.com/
[2]: http://www.google.com/cse/all
[3]: http://www.google.com/enterprise/search/products_gss_pricing.html
[4]: http://www.theloopyewe.com/
[5]: http://schema.org/
[6]: http://www.sitemaps.org/
[7]: /blog/2013/05/13/structured-data-with-schema-org.html
[8]: http://www.elasticsearch.org/
[10]: http://www.theloopyewe.com/sitemap.xml
[11]: https://github.com/ruflin/Elastica/
[12]: http://www.opensearch.org/Home
[13]: https://www.theloopyewe.com/search/results.atom?q=spring+fling
[14]: https://www.theloopyewe.com/search/results.rss?q=spring+fling
[15]: https://www.theloopyewe.com/search/opensearch.xml
[16]: https://www.theloopyewe.com/search/?q=madelinetosh
[17]: https://www.theloopyewe.com/shop/g/yarn/madelinetosh/tosh-dk/

View File

@@ -1,179 +0,0 @@
---
title: "Barcoding Inventory with QR Codes"
layout: post
tags: [ 'barcode', 'qr', 'retail', 'product', 'label', 'scan' ]
description: A web-centric, user-friendly approach for using barcodes in a retail shop.
---
Most decently-sized stores will have barcodes on their products. For the store, it makes the checkout process extremely
easy and accurate. For the consumer, barcodes might be useful with a phone app to scan them. I needed to make the
inventory scannable at the [shop][1], and I really wanted to do it in a more meaningful way than 1D barcodes could
support.
## Barcodes: 1D vs 2D
There are two different kinds of barcodes: 1 dimensional and 2 dimensional. The 1D allows for a purely linear scan of
simple, [UPC][2]-like barcodes. While 1D barcodes are extremely commonplace on many products, I dislike them because
they can't provide any context.
For example, if I were shopping in [Target][3] and scanned a UPC barcode with a regular phone app, it might take me to
the [Amazon][4] listing first - not necessarily great for Target's business, but it also becomes a completely separate
brand channel distracting my thoughts. Another example is when UPCs aren't registered on a product - different retail
stores will make up their own internal barcode which isn't helpful at all if I try to scan it.
On the otherhand, 2D barcodes require complex parsing but they can hold much more data. [QR codes][5] are one extremely
common form of 2D barcodes and they typically encode URLs. With my goal of providing more context, URLs provide just
that - not only with a domain name, but an arbitrary path. If somebody scanned an item at our shop, they'd at least get
redirected through the shop's website.
One disadvantage that QR codes have compared to 1D barcodes is their size and resolution requirements. All 1D barcodes
could theoretically be 1 pixel high, but QR codes must be square. To help ensure a reasonable QR codes, most people
will use a URL shortener service - shorter URLs mean simpler QR designs, simpler designs mean the QR code can be read
more easily and doesn't need to be large.
Another disadvantage to QR codes is that 2D handheld scanners are significantly more expensive than 1D. Fortunately,
many previously-used 2D scanners can be found on [eBay][6] for very reasonable prices. Unfortunately, I found that
some of the used ones would quickly turn unreliable after a period of time.
## Mapping URLs to retail "things"
While inventory was the primary target of barcoding, I really wanted to barcode most things involved with retail
workflows (like order receipts). With that in mind I figured I needed to store three properties:
* `insignia` - the unique, short identifier (e.g. `EyV3chYax`)
* `target_ref` - the type of "thing" (e.g. `inventory` or `order`)
* `target_id` - the ID of the "thing" (e.g. `010035EA-9F6D-41A2-97C4-EEB5A3F3034A`)
I created a manager which supports three basic operations (internally it uses a map of the different types of
"things"):
* `getInsignia($target)` - which returns the short identifier/insignia
* `getTarget($insignia)` - which returns the application object
* `getResponse($insignia)` - which returns an appropriate HTTP response
I created a couple of HTTP endpoints which utilize the manager:
* `/io/{insignia}` - which returns the result of `getResponse` (typically a redirect)
* `/io/{insignia}.png` - which returns the QR code image
Then, whenever I want to print a QR code on a document, I just have to do:
<img src="{% raw %}{{ web_insignia_png(transaction) }}{% endraw %}" style="float:right;" />
Your Receipt for Order #{% raw %}{{ transaction.id }}{% endraw %}
Further, the QR code can be used with a redirecting short domain for even simpler codes:
> ![QR Code](http://www.theloopyewe.com/io/EyV3chYax.png?s=2 "http://tle.io/EyV3chYax")
> [`http://tle.io/EyV3chYax`](http://www.theloopyewe.com/io/EyV3chYax)
## Adding More Context
One of the reasons I wanted to use QR codes was context. Aside from scans now landing on the shop's website, they can
be even more context-aware through security roles. For example, if a customer scans the QR code above, they'll end up
on the product page for the shop as you would expect; but if an admin scans it they'll end up on the main inventory
page to see current quantities and recent transactions.
Or, a better example is with order receipts. If somebody scans their order receipt, they'll be required to login and
then will be taken to their order details page, assuming the order was on their account. If an admin scans the receipt
(perhaps while packing it) they'll be taken to the administrative, detailed view of the order.
The idea of context doesn't only apply to where a user might end up, it also applies to how they get there. For
example, sometimes there will be more than one bolt of a single fabric pattern, and, since each bolt is a different
"thing" in the system, they each have a different QR code. If a customer scans either of the bolts, they would get
taken to the exact same public product page. However, when an admin scans a bolt they'll get taken to the detailed
view showing which orders were cut on that specific bolt and how much yardage the bolt still has.
## Integrated Context
At this point, the barcodes were extremely accessible for one-off scans, but I also wanted to integrate the barcodes
into specific points of the system. For the computers we're using USB 2D barcode scanners which are capable of acting
like a keyboard device (the computer sees it "typing" whatever it scans, followed by an Enter). The most useful
integration point was the POS for handling in-store shoppers.
For the POS, I created a new UI component which auto-focused itself. Once something gets scanned, it sends the scanned
data to the server so it can figure out what should happen. For QR code scans, it performs the insignia lookup to find
the actual inventory item. Then, for simple inventory items it can just add the scanned item to the order. For fabric
on the bolt, it comes back with a dialog about how much to cut. For complex items, it shows a dialog for further
specifications. Or there might just be a discrepancy and it needs to come back and show a message. Once the item is
added it provides visual feedback, the scan field is re-focused and the cycle continues. It works something like the
following in a browser...
<blockquote>
<div style="color:#666666;padding-left:5px;">
<span id="demoscan-dotty" style="background-color:#CC0000;border:#999999 solid 1px;border-radius:3px;display:inline-block;margin-bottom:-1px;width:12px;height:12px;"></span>
<a id="demoscan-talkr" class="subtle" href="#" style="display:inline-block;padding:3px 2px;">Click to Scan</a>
<input id="demoscan-input" type="text" style="border:transparent;background-color:transparent;height:1px;margin:0;padding:0;width:1px;" />
</div>
<script src="//ajax.googleapis.com/ajax/libs/mootools/1.4.5/mootools-yui-compressed.js"></script>
<script type="text/javascript">
var talkr = $('demoscan-talkr');
var input = $('demoscan-input');
var dotty = $('demoscan-dotty');
dotty.set('tween', { link : 'cancel', duration : 200 });
input
.addEvent(
'focus',
function () {
talkr.set('text', 'Ready to Scan');
dotty.tween('background-color', '#66FF66');
}
)
.addEvent(
'blur',
function () {
talkr.set('text', 'Click to Scan');
dotty.tween('background-color', '#CC0000');
}
).addEvent(
'keydown',
function (e) {
if ('enter' != e.key) {
return;
} else if (!this.value) {
return;
}
prompt('Seems like you scanned...', this.value);
this.value = '';
this.focus();
}
)
;
talkr
.addEvent(
'click',
function (e) {
input.focus();
e.preventDefault();
}
)
;
</script>
</blockquote>
## Conclusion
I feel like the shop is able to better grow both technically and logistically by having used QR codes as opposed to a
classic barcode system. A few techy customers have tried the QR codes, but it's not really something we've been
promoting. Once the website has a proper mobile-friendly version we'll have a better opportunity and reason to try and
impress customers with the QR codes. In the meantime, the QR codes have been an immense time-saver for both staff and
shoppers checking out at the shop.
[1]: http://www.theloopyewe.com/
[2]: http://en.wikipedia.org/wiki/Universal_Product_Code
[3]: http://www.target.com/
[4]: http://www.amazon.com/
[5]: http://en.wikipedia.org/wiki/QR_code
[6]: http://www.ebay.com/

View File

@@ -1,240 +0,0 @@
---
title: Distributed Docker Containers
layout: post
tags: [ 'aws-ec2', 'docker', 'nodejs', 'scs-utils' ]
description: A strategy for integrating Docker services across multiple hosts and data centers.
---
One thing I've been working with lately is [Docker][1]. You've probably seen it referenced in various tech articles
lately as the next greatest thing for cloud computing. Docker runs "containers" from base "images" which essentially
allow running many lightweight virtual machines on any recent, Linux-based system. Internally, the magic behind it is
[lxc][2], although Docker adds a lot more magic to improve and make it more usable.
For a long time now I've used virtual machines for development - it allows me to better simulate how software runs out
on production servers. Historically, [Vagrant][3] + [VirtualBox][4]/[VMWare Fusion][5]/[EC2][6] have been great tools
for that, but they have limitations and they tend to drift a bit from production architecture.
## The Problem
In trying to duplicate the production environments, it's not typically feasible for me to run more than one virtual
machine on my laptop. I could split my single local virtual machine to multiple EC2 instances; but then it becomes more
difficult to manage IP addresses for the various service dependencies as the instances get stopped/started between
working sessions (in addition to the extra costs). VPCs with private IP addresses do help with that a lot, as long as
there's a sane way to manage those resources.
Another issue that comes up when combining services on a single host is dependency overlap. One example of this is
shared modules. Some newer features of nginx require a newer version of the openssl libraries. However, PHP doesn't
necessarily support the newer version of openssl without upgrading quite a few other components. While there may be
workarounds, the inconvenience of it all typically just prompts me to avoid working on that particular feature,
unfortunately.
Ultimately, I want to have the same software and network stack that I use in a production environment, but in a
development environment and, if possible, locally on my laptop.
## The Alternatives
This problem is certainly not unique, but a practical solution has been difficult for me to find. I've been
experimenting with a few different technologies over the years trying to solve this sort of thing.
Vagrant is obviously the first practical solution. For me, it has been a functional solution for quite a while, but not
an optimal one. Like I mentioned before, it's a bit bulky when attempting to mimic non-trivial architectures on a
standard laptop. For a while now, I've been finding the motivation and time to migrate to a better setup.
With the advent of Docker, many of my software requirements become much simpler. Each piece of software can run in its
own container and I don't have to worry about dependency overlap. Multiple containers are *significantly* cheaper than
trying to run multiple virtual machines. I could even reuse containers built on my development machine out on
production. One thing Docker doesn't effectively solve is service dependency. It can support them on a single host with
links, but not across multiple hosts.
I've been keeping an eye out for other tools which may help solve these problems. Some of them are:
* [decking][7] - seems to primarily build on top of Docker's built-in link functionality for service dependency within
a single host
* [etcd][11] - an excellent distributed, hierarchical key-value store; very useful for monitoring configuration values
and being notified when they change (related: [confd][22])
* [fig][8] - seems like [Foreman][21], but geared for Docker containers
* [flynn][11] - originally I was very excited about this, however it still seems underdeveloped for the purposes of
service discovery of arbitrary services; I'm still very hopeful
* [serf][9] - a very new client for distributing data across a cluster and taking action on it. To me it seems like
more of a management tool (like half of the [mcollective][10] utility)
Recently, I've been becoming more aquainted with [bosh][12], an interesting tool for managing large deployments along
with all their dependencies. To me, bosh always seems overly complicated for whatever I'd want to accomplish and has
quite a few bosh-specific practices to learn. Its resource and service management is very thorough, although it takes a
while to get comfortable with it. It seems more like an infrastructure management tool rather than a service management
tool, and I was hoping to keep those responsibilities separate and simpler. Ultimately, I think bosh could be made to
work... but I was still hoping for something different, lighter, and utilizing more common open source tools that I was
already familiar with.
## The Ideas
I had a simple application in mind to roughly define my "[minimum viable product][13]":
0. run WordPress web application, a MySQL server, and a backup MySQL server as separate services
0. runtime parity (between development and production)
1. configure services the exact same way
1. run services the exact same way
1. depend on other services the exact same way
0. architecture flexibility
1. in production, run the services on three separate hosts across two separate data centers
1. in development, run all services on a single virtual machine on my laptop
0. service flexibility - be able to dynamically relocate services without manual reconfiguration and minimal downtime
* combine services into one or two hosts during quiet hours
* move a service to a more powerful instance during high load
0. self-provisioning - when a container requires a particular volume or network, make sure it can be automatically
provisioned and de-provisioned
First off, I knew I wanted to run the services inside of Docker containers. I can only imagine Docker's ubiquity will
continue to grow, and the ability to run completely arbitrary software anywhere with minimal host dependencies seemed
like a perfect, lightweight solution.
I've used [Puppet][14] to configure servers and applications for a long time. While I dislike the overhead it requires
for smaller use cases, I really like the consistency and declarative nature that it provides. Since I'll continue to use
it for host server configuration, it's a small stretch to also use it for configuring the service runtimes.
When it comes down to it, I think there are two main questions that a service must answer:
* How should I work? and
* How do I connect with the rest of the world?
The first question can be managed and configured via Puppet. Once a service is configured and compiled to run as
requested, it never needs to go through that process again. This approach lets compiled Docker images be consistently
reused across time and servers.
The second question deals with pointing WordPress to the MySQL server, or pointing MySQL server to the data directory,
or running the MySQL backup server on a specific network segment. These decisions and connections have nothing to do
with how the service should work, so they can be changed as needed. So far, I have four main dependencies about how
these containers get connected:
0. volumes - giving containers a place to write persistent data (e.g. WordPress `wp-content/uploads` directory)
0. provided services - a service that the container is running (e.g. `http` on `80/tcp`)
0. required services - a service that the container needs (e.g. `mysql`)
0. network - how the container is attached to the network
I think these basic aspects effectively describe everything needed to manage a self-contained service.
## The Implementation
The next step of an idea is to prototype it, and that's where I am today. There are several pieces that I've been
working on, but three general topics...
## Service Discovery
One of the most interesting concepts is service discovery. I wanted containers to be able to connect with each other
across multiple hosts and data centers. I've been using DNS for host discovery and, while it works great it doesn't seem
entirely appropriate for "containerized" discovery. Through [`A`][23] records, DNS easily picks up on hosts changing,
but is not so good for dynamic ports. DNS [`SRV`][24] records seem *much* more appropriate with attributes for both
hostname and port, but `SRV` records are rarely used in internal APIs.
Originally I was using etcd to register and discover services, but I found it to be inefficient for filtering services
and propagating changes. Instead, I created a specialized client/server protocol to handle the registration and
discovery process. In technical terms, the protocol works like the following...
WordPress needs a database, so before it starts the container, it connects with the disco server:
> **container**: Hi, I need a `mysql` service to talk to - who's available?
> **disco**: You should talk with `192.0.2.11:39313` - I'll keep you posted if it changes, but let me know if you no
> longer need it
The results are injected as environment variables when the container is started and can use them however it likes.
WordPress obviously runs a web server, so, once the container is started, the container manager connects with disco:
> **container**: Hi, I'm `wordpress` and I have an `http` service available at `192.0.2.12` on port `39212`
> **disco**: Nice to meet you; let me know if you no longer provide it
Then things are running happily and you could ask the disco server where to find `wordpress/http` to pull it up in your
web browser. If the database server crashes and recovers elsewhere, a few things will happen. First, when disco realizes
MySQL is no longer available (either by a clean disconnect, heartbeat timeout, or socket disconnect), it notifies
everyone who is subscribed that the endpoint has been dropped:
> **disco**: Looks like you were using `mysql`, but I'm sorry to tell you it's no longer available
> **container**: Thanks for letting me know
The container manager then attaches to the container to run an update command letting it know about the change. The
command can take care of updating the runtime configuration and restarting the application server.
Eventually the new MySQL server will come back online and register itself. Once registered, disco realizes that
WordPress is subscribed, so it lets it know:
> **disco**: Great news, I have a new `mysql` endpoint for you at `192.0.2.14.39414`
> **container**: Excellent, thanks
And it again runs the live update command, updating the environment and restarting the application server.
The disco protocol has a few more features (like using a single server for more than one WordPress/MySQL setup, or
filtering services by arbitrary tags like availability zones to improve load balancing), but that's the general idea.
## Configuration Files
I'm using YAML files to describe images and containers. They get compiled to a static version, and then cached based on
the image configuration. For example, take a look at this example [scs-wordpress][16] image manifest. It describes the
various connection points, docker details, and how it's configured. Now, take a look at the [Puppet manifests][17] which
enumerates all the configuration options which affect how the service will run. Finally, take a look at the
[sample config][18] which ties together what kind of image it needs to be able to run (configuration) and how that image
will be connected to the world.
## Self-Provisioning
For each of the four dependency/connection types (volumes, service provider, service dependent, network), I'm trying to
make them suitable for local development and AWS EC2 deployment. For example:
* AWS EC2 volumes can be auto-created, mounted, and attached to hosts for use by docker containers. This allows
services to drift across instances
* Likewise, I can also just use a local path for a volume and avoid an official network mount
* Various other strategies can be added for each dependency:
* nfs-volume: to attach a docker mount point to an external NFS mount
* aws-ec2-eni: to attach an ENI as the network interface for a docker container
My goal is to provide a manifest configuration file to a machine and know that it will load up whatever it needs to run,
including recompiling the image from scratch if it's not available in any caches.
## The Prototype
So, all those ideas are currently under development in my [`scs-utils`][20] repository. I've created a repository called
[`scs-example-blog`][19] which is a functional implementation of my original MVP. It provides a `Vagrantfile` for you to
easily try it out yourself and it goes through the process of getting the containers running on a single virtual machine,
accessing the services from the host, and then splitting them up across multiple virtual machines. It's more a tutorial
describing the steps - typically the service deployment would be managed by Puppet.
## The Conclusion
All these ideas are absolutely a work in progress and I'm still actively tweaking the implementation, but it was in a
functional state to briefly discuss the idea. So far it has been an excellent learning opportunity for Docker, custom
network protocols, and splitting some of the services I've previously been running into more reusable components. Even
if `scs-utils` isn't still what I'm using in 2 years, the refactoring it has motivated makes it significantly easier to
port into whatever more valuable tool surfaces further down the road.
[1]: https://www.docker.io/
[2]: http://linuxcontainers.org/
[3]: http://www.vagrantup.com/
[4]: https://www.virtualbox.org/
[5]: http://www.vmware.com/products/fusion
[6]: http://aws.amazon.com/ec2/
[7]: http://decking.io/
[8]: http://orchardup.github.io/fig/
[9]: http://www.serfdom.io/
[10]: http://puppetlabs.com/mcollective
[11]: https://flynn.io/
[12]: http://docs.cloudfoundry.org/bosh/
[13]: http://en.wikipedia.org/wiki/Minimum_viable_product
[14]: http://puppetlabs.com/puppet/puppet-open-source
[15]: https://github.com/coreos/etcd
[16]: https://github.com/dpb587/scs-wordpress/blob/3ba391d4f82da5c9642d88962e0bce32eb692add/scs/image.yaml
[17]: https://github.com/dpb587/scs-wordpress/tree/3ba391d4f82da5c9642d88962e0bce32eb692add/scs/puppet/scs/manifests
[18]: https://github.com/dpb587/scs-example-blog/blob/master/wordpress/manifest.yaml
[19]: https://github.com/dpb587/scs-example-blog
[20]: https://github.com/dpb587/scs-utils
[21]: http://ddollar.github.io/foreman/
[22]: https://github.com/kelseyhightower/confd
[23]: http://en.wikipedia.org/wiki/A_record#A
[24]: http://en.wikipedia.org/wiki/SRV_record

View File

@@ -1,180 +0,0 @@
---
title: Photo Galleries for Jekyll
layout: post
tags: [ 'blog', 'gallery', 'iphoto', 'jekyll', 'jekyllrb', 'photo', 'ruby' ]
description: Easily exporting my iPhoto album to this Jekyll-based site.
---
I had a trip to London and Iceland several weeks ago, and I wanted to share some of those photos with people. In the
past I've put those sorts of photo galleries on Facebook, but some friends don't have accounts there and I figured I
could/should just keep my photos with my other personal stuff here.
Unlike [WordPress][1], [Jekyll][2] doesn't really have a concept of photo galleries, and since Jekyll is a static site
generator it makes things a little more difficult. I looked through [several][3] [other][4] [posts][5] discussing Jekyll
photo galleries, but they all seemed a bit more primitive than what I wanted. I wanted to:
* stick with existing Jekyll paradigms (e.g. [markdown][8] file to static page),
* retain metadata about my photos (e.g. location data, camera EXIF data),
* support multiple views about my galleries (e.g. photo list, map, slideshow),
* ensure photos can have landing pages and be easily navigated, and
* avoid committing images to my git repository.
After giving it some thought, I realized this was going to be a multi-step process.
0. Script the process of exporting my existing photos to Jekyll-friendly structures.
0. Find a Jekyll/[Liquid][7] plugin to enumerate directories/files and use the results.
0. Create templates and pages for my gallery and its photos.
0. Publish the site!
## Step 1: Export existing photo galleries (iPhoto)
I take pretty much all my photos with my phone and those photos then get synced up with iPhoto. At the end of my trip, I
browse through the photos and create an album of interesting ones. Normally I don't go through and give every photo a
title and description, but if I'm planning on sharing them I add brief notes within iPhoto.
I knew my iPhoto metadata was stored in `AlbumData.xml`, but I've always had poor performance with massive XML data
files. I decided to start with a different approach: [AppleScript][9]. The following snippet gets me the file paths of
all the photos (in order) from whatever album I ask for:
{% highlight applescript %}{% raw %}
on run argv
set output to ""
tell application "iPhoto"
set vAlbum to first item of (get every album whose name is (item 1 of argv))
set vPhotos to get every photo in vAlbum
repeat with vPhoto in vPhotos
set output to output & original path of vPhoto & "
"
end repeat
end tell
return output
end run
{% endraw %}{% endhighlight %}
So, to get the photos in my album named "London-Iceland Trip" I can do:
{% highlight console %}{% raw %}
$ osascript export-iphoto-album.applescript 'London-Iceland Trip'
~/Pictures/iPhoto Library.photolibrary/Masters/2014/03/13/20140313-154842/IMG_0303.JPG
~/Pictures/iPhoto Library.photolibrary/Masters/2014/03/13/20140313-154842/IMG_0308.JPG
...snip...
{% endraw %}{% endhighlight %}
With some tweaks I can get more than just the path to a photo:
{% highlight console %}{% raw %}
$ osascript export-iphoto-album.applescript 'London-Iceland Trip'
altitude: 16
latitude: 51.50038
longitude: -0.12786667
name: A Classic View
date: Thursday, March 6, 2014 at 4:44:12 PM
path: ~/Pictures/iPhoto Library.photolibrary/Masters/2014/03/13/20140313-154842/IMG_0303.JPG
title: A Classic View
------
QCon was held at The Queen Elizabeth II Conference Centre and this was the view out one of the common areas.
------------
...snip...
{% endraw %}{% endhighlight %}
The next piece is to write something which will clean up the output, resize the photos, and write out all the different
Jekyll files. For that I created a [PHP][10] script since it was going to be easiest for me. Once complete, I then just
pipe the export results to the script and specify the image sizes I want:
{% highlight console %}{% raw %}
$ osascript ../jekyll-gallery/export-iphoto.applescript 'London-Iceland Trip' | \
php ../jekyll-gallery/convert.php 2014-london-iceland-trip \
--export 96x96 --export 200x200 --export 640 --export 1280
df5150c-a-classic-view...96x96...200x200...640...1280...mdown...done
7cf02b5-night...96x96...200x200...640...1280...mdown...done
...snip...
{% endraw %}{% endhighlight %}
Once complete, all the resized images are in `asset/gallery/2014-london-iceland-trip` and my markdown files with the
photo details are in `gallery/2014-london-iceland-trip` and they're easily [readable][15].
## Step 2: Jekyll plugin
At a minimum, I wanted to have a listing of all the photos in a gallery index page. After some searches, I found
[two][11] [scripts][12] which became the inspiration for my final plugin. My [final plugin][16] looks like:
Tag:
loopdir
Attributes:
match: a pattern to match files within the path (e.g. "*.md")
parse: whether to load the file and parse for YAML front matter
path: a directory, relative to the site root, to find files
sort: a property to search by (e.g. "path")
Result:
An "item" object is exposed to the template with a "page"-like structure.
If parsing is enabled, the YAML properties are available as "item.title".
Which means I can easily compose a simple photo list with:
{% highlight jinja %}{% raw %}
{% loopdir path:"gallery/2014-london-iceland-trip" match:"*.md" sort:"ordering" %}
<a href="/{{ item.fullname }}.html">
<img alt="Photo: {{ item.title }}" height="200" src="/{{ item.fullname }}~200x200.jpg" title="{{ item.title }}" width="200" />
</a>
{% endloopdir %}
{% endraw %}{% endhighlight %}
I reuse this plugin elsewhere for regular directory listings.
## Step 3: Create templates
I've started out with two reusable templates in my `_includes` directory:
0. [Gallery List][13] - a simple listing of thumbnails from all the photos in the gallery
0. [Interactive Map][14] - an interactive map showing where all the photos were taken
I can pass arguments (like the gallery name) to the include which makes it easy to embed a gallery in any page:
{% highlight jinja %}{% raw %}
{% include gallery_list.html gallery='2014-london-iceland-trip' %}
{% endraw %}{% endhighlight %}
## Step 4: Publish
After generating everything locally, I just have to do a couple steps:
0. Commit all the new `gallery/2014-london-iceland-trip` files (and new templates)
0. Run `_build/aws/publish-asset.sh $AWS_S3CMD_CONFIG gallery/2014-london-iceland-trip` to upload all the exported JPGs
0. Run `_build/aws/build.sh _build/aws/publish.sh $AWS_S3CMD_CONFIG` to upload any modifications from the rest of the
site
To make things easier for myself and, possibly, others I put the conversion scripts in my [jekyll-gallery][17] repo.
Now I'm able to refer people to the [gallery](/gallery/2014-london-iceland-trip/) or embed the gallery somewhere
useful...
<div style="line-height:0;padding:4px 0 0 1px;">
{% loopdir path:"gallery/2014-london-iceland-trip" match:"*.md" sort:"ordering" %}<a href="/{{ item.fullname }}.html" style="display:inline-block;margin:3px;text-decoration:none;"><img alt="Photo: {{ item.title }}" height="48" src="{{ site.asset_prefix }}/{{ item.fullname }}~96x96.jpg" title="{{ item.title }}" width="48" style="padding:1px;" /></a>{% endloopdir %}
</div>
[1]: http://wordpress.org/
[2]: http://jekyllrb.com/
[3]: https://github.com/ggreer/jekyll-gallery-generator
[4]: http://www.mgratzer.com/from-wordpress-to-jekyll/
[5]: https://github.com/tsmango/jekyll_flickr_set_tag
[6]: https://help.github.com/articles/what-are-github-pages
[7]: http://liquidmarkup.org/
[8]: http://daringfireball.net/projects/markdown/
[9]: https://developer.apple.com/library/mac/documentation/applescript/Conceptual/AppleScriptX/AppleScriptX.html
[10]: http://www.php.net/
[11]: https://gist.github.com/jgatjens/8925165
[12]: http://simon.heimlicher.com/articles/2012/02/01/jekyll-directory-listing
[13]: https://github.com/dpb587/dpb587.me/blob/master/_includes/gallery_list.html
[14]: https://github.com/dpb587/dpb587.me/blob/master/_includes/gallery_map.html
[15]: https://github.com/dpb587/dpb587.me/blob/master/gallery/2014-london-iceland-trip/df5150c-a-classic-view.md
[16]: https://github.com/dpb587/dpb587.me/blob/master/_plugins/loopdir.rb
[17]: https://github.com/dpb587/jekyll-gallery

View File

@@ -1,228 +0,0 @@
---
title: Search by Color with Elasticsearch
layout: post
tags: [ 'color', 'ecommerce', 'elasticsearch', 'hsv', 'search', 'weighted' ]
description: Some mappings, strategies, and queries for advanced color searching with elasticsearch.
primary_image: /blog/2014-04-24-color-searching-with-elasticsearch/search0.png
---
A [year ago][1] when I updated the [TLE website][2] I dropped the "search by color" functionality. Originally, all the
colors were indexed into a database table and the frontend generated some complex queries to support specific and
multi-color searching. On occasion, it caused some database bottlenecks during peak loads and with some particularly
complex color combinations. The color search was also a completely separate interface from searching other product
attributes and availability. It was neat, but it was not a great user experience.
It took some time to get back to the search by color functionality, but I've finally been able to get back to it and,
with [elasticsearch][3], significantly improve it.
<a href="http://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/"><img alt="Screenshot: colorized yarn" height="162" src="{{ site.asset_prefix }}/blog/2014-04-24-color-searching-with-elasticsearch/search0.png" width="628" /></a>
## Color Quantification
One of the most difficult processes of supporting color searches is to figure out the colors in products. In our case,
where we had thousands of items to "colorize", it would be easier to create an algorithm than have somebody manually
pick out significant colors. When it comes to algorithms and research, the process is called [color quantization][8].
A lot of the inventory at the shop is yarn and, unfortunately, the tools I tried didn't do a good job at picking out the
fiber colors (they would find significance in the numerous shadows or average colors).
Ultimately I ended up creating my own algorithm based on several strategies. In addition to finding the significant
colors it also keeps track of their ratios making it easy to realize multi-color items vs items with accent colors.
After batch processing inventory to bring colors up to date, I added hooks to ensure new images are processed for colors
as they're uploaded.
<a href="https://www.theloopyewe.com/shop/p/78C97118-Gobelin-A-moi-le-coco"><img alt="Screenshot: colorized yarn" height="129" src="{{ site.asset_prefix }}/blog/2014-04-24-color-searching-with-elasticsearch/colorizer-yarn.png" width="628" /></a>
<a href="https://www.theloopyewe.com/shop/p/86330BB1-DS23-Seafaring"><img alt="Screenshot: colorized fabric" height="129" src="{{ site.asset_prefix }}/blog/2014-04-24-color-searching-with-elasticsearch/colorizer-fabric.png" width="628" /></a>
You can see it noticed the significant colors of the yarn and fabric above, along with their approximate ratios. With
some types of items, it may be possible to infer additional meaning such as the "background color" of fabric.
## Color Theory
When it comes to color, there are a few standard methods for measuring it. Probably the most familiar one from a web
perspective is [RGB][6]. Unfortunately, RGB doesn't efficiently quantify the "color" or hue. For example,
<span style="background-color:#F42805;border-radius:4px;padding:0 3px;">244, 40, 5</span> and
<span style="background-color:#F4D6D6;border-radius:4px;padding:0 3px;">244, 214, 214</span>
are both obviously reddish, but the second color has high green and blue values yet the blue and green colors are not
present.
A much better model for this is [HSV][7] (or HSL). The "color" (hue, `H`) cycles from 0 thru 360 where 0 and 360 are
both red. The `S` for "saturation" ranges from 0 to 100 and describes how much "color" there is. Finally, the `V` for
"value" (or `B` for "brightness") ranges from 0 to 100 and describes how bright or dark it is. Compare the following
examples for a better idea:
* <span style="background-color:#B23535;border-radius:4px;padding:0 3px;">0, 70, 70</span>
* <span style="background-color:#B27C7C;border-radius:4px;padding:0 3px;">0, 30, 70</span>
* <span style="background-color:#4C1616;border-radius:4px;padding:0 3px;">0, 70, 30</span>
* <span style="background-color:#4C3535;border-radius:4px;padding:0 3px;">0, 30, 30</span>
* <span style="background-color:#35B2B2;border-radius:4px;padding:0 3px;">180, 70, 70</span>
* <span style="background-color:#7CB2B2;border-radius:4px;padding:0 3px;">180, 30, 70</span>
* <span style="background-color:#164C4C;border-radius:4px;padding:0 3px;">180, 70, 30</span>
* <span style="background-color:#354C4C;border-radius:4px;padding:0 3px;">180, 30, 30</span>
Within elasticsearch we can easily map an object with the three color properties as integers:
{% highlight javascript %}
{ "color" : {
"properties" : {
"h" : {
"type" : "integer" },
"s" : {
"type" : "integer" },
"v" : {
"type" : "integer" } } } }
{% endhighlight %}
## Mappings
Elasticsearch will natively handle [arrays][5] of multiple colors, but `color` needs to become a [`nested`][4] mapping
type in order to support realistic searches. For example, we could write a query looking for a dark blue, but unless
it's a nested object the query could match items which have any sort of blue (`color.h = 240`) and any sort of dark
(`color.v < 50`). To make `color` nested, we just have to add `type = nested`. Then we're able to write `nested` filters
which will look like:
{% highlight javascript %}
{ "nested" : {
"path" : "color",
"filter" : {
"bool" : {
"must" : [
{
"term" : {
"color.h" : 240 } },
{
"range" : {
"color.v" : {
"lt" : 50 } } } ] } } } }
{% endhighlight %}
With the extra color proportion value mentioned earlier, we're also able to add a `ratio` range alongside `h`, `s`, and
`v`. This will allow us to find items where blue is more of a dominant color (e.g. more than 80%). Another searchable
fact which may be useful is `color_count` - then we would be able to find all solid-color products, or all dual-color
products, or just any products with more than four significant colors.
While working on a frontend interface, I was having trouble faceting popular colors. A lot of dull colors were coming
back. As a first step, I started using some [`terms`][9] aggregations with a `value_script` which created large buckets
of colors from the `h`, `s`, and `v` tuple. That helped significantly, but then it seemed like there was a
disproportionate number of very dark and very light colors. Instead of adding additional calculations to the aggregation
during runtime, I decided to pre-compute the buckets that the colors should belong to. Now it's doing more advanced
calculations and no runtime calculations. For example, all low-`v` colors will end up in a single bucket
`{ h : 360 , s : 10 , v : 10 , ... }`. Similar rules trim low-saturation colors and create the appropriate buckets for
colors.
## Searches
Given four key properties (hue, saturation, value, and color ratio), I needed a way to represent the searches from
users. For searching individual colors, I settled on the following syntax:
{ratio-min}-{ratio-max}~{hue}-{sat}-{val}~{hue-range}-{sat-range}-{val-range}
This way, if a user is very specific about the dark blue they want, and they want at least 80% of the item to be blue,
the color slug might look like: [`80-100~190-100-50~10-5-5`][10]. Within the application, this gets translated into a
[`filtered`][11] query. The filter part looks like:
{% highlight javascript %}
{ "filter": {
"and": [
{ "nested": {
"path": "color",
"filter": {
"and": [
{ "range": {
"ratio": {
"gte": 80,
"lte": 100 } } },
{ "range": {
"h": {
"gte": 180,
"lte": 200 } } },
{ "range": {
"s": {
"gte": 95,
"lte": 100 } } },
{ "range": {
"v": {
"gte": 45,
"lte": 55 } } } ] } } } ] } }
{% endhighlight %}
The query part then becomes responsible for ranking using a basic calculation which roughly computes the distance
between the requested color and the matched color. The [`function_score`][13] query currently looks like:
{% highlight javascript %}
{ "function_score": {
"boost_mode": "replace",
"query": {
"nested": {
"path": "color",
"query": {
"function_score": {
"score_mode": "multiply",
"functions": [
{ "exp": {
"h": {
"origin": 190,
"offset": 2,
"scale": 4 } } },
{ "exp": {
"s": {
"origin": 100,
"offset": 4,
"scale": 8 } } },
{ "exp": {
"v": {
"origin": 50,
"offset": 4,
"scale": 8 } } },
{ "linear": {
"ratio": {
"origin": 100,
"offset": 5,
"scale": 10 } } } ] } },
"score_mode": "sum" } },
"functions": [
{ "script_score": {
"script": "_score" } } ] } }
{% endhighlight %}
The `_score` can then be used in sorting to show the closest color matches first.
<a href="http://www.theloopyewe.com/shop/search/cd/80-100~190-100-50~10-5-5/g/59A9BAC5/"><img alt="Screenshot: search screen shot" height="400" src="{{ site.asset_prefix }}/blog/2014-04-24-color-searching-with-elasticsearch/search1.png" width="628" /></a>
Of course, these color searches can be added alongside the other facet searches like product availability, attributes,
and regular keyword searches.
## User Interface
One of the more difficult tasks of the color search was to create a reasonable user interface to front the powerful
capabilities. This initial version uses the same interface as a year ago, letting users pick from the available "color
dots". Ultimately I hope to improve it with a more advanced, yet simple, [Raphaël][12] interface which would let them
pick a specific color and say how picky they want to be. That goal requires a fair bit of time and learning though...
## Summary
I'm excited to have the search by color functionality back. I'm even more excited about the possibilities of better,
advanced user searches further down the road. After it gets used a bit more, I hope we can more prominently promote the
color search functionality around the site. Elasticsearch has been an excellent tool for our product searching and it's
exciting to continue expanding the role it takes in powering the website.
[1]: /blog/2013/04/27/new-website-for-the-loopy-ewe.html
[2]: http://www.theloopyewe.com/
[3]: http://www.elasticsearch.org/
[4]: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/mapping-nested-type.html
[5]: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/mapping-array-type.html
[6]: http://en.wikipedia.org/wiki/RGB_color_model
[7]: http://en.wikipedia.org/wiki/HSL_and_HSV
[8]: http://en.wikipedia.org/wiki/Color_quantization
[9]: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/search-aggregations-bucket-terms-aggregation.html
[10]: http://www.theloopyewe.com/shop/search/cd/80-100~190-100-50~10-5-5/g/59A9BAC5/
[11]: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/query-dsl-filtered-query.html#query-dsl-filtered-query
[12]: http://raphaeljs.com/
[13]: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/query-dsl-function-score-query.html

View File

@@ -1,737 +0,0 @@
---
title: "Simplifying My BOSH-related Workflows"
layout: "post"
tags: [ "aws", "bosh", "cloudformation", "cloudfoundry", "cloque", "docker", "ec2", "packaging", "snapshots", "twig" ]
description: "Discussing some commands and wrappers I've been adding on top of BOSH."
---
Over the last nine months I've been getting into [BOSH][1] quite a bit. Historically, I've been [reluctant][20] to
invest in BOSH because I don't entirely agree with its architecture and steep learning curve. BOSH
[describes itself][1] with...
> BOSH installs and updates software packages on large numbers of VMs over many IaaS providers with the absolute
> minimum of configuration changes.
>
> BOSH orchestrates initial deployments and ongoing updates that are:
>
> * Predictable, repeatable, and reliable
> * Self-healing
> * Infrastructure-agnostic
With continued use and experience necessitated from the [logsearch][2] project, I saw ways it would solve more critical
problems for me than it would create. For that reason, I started experimenting and migrating some services over to
BOSH to better evaluate it for my own uses. To help bridge the gap between BOSH inconveniences and some of my
architectural/practical differences I've been making a tool called [`cloque`][3].
You might find the ideas more useful rather than the `cloque` code itself - it is, after all, experimental and written
in PHP (since that's why I'm most productive in) whereas `bosh` is more Ruby/Go-oriented.
## Infrastructure First
Generally speaking, BOSH needs some help with infrastructure (i.e. it can't create its own VPC, network routing tables,
etc). Additionally, sometimes deployments don't even need the BOSH overhead. Within `cloque`, I've split management
tasks into two components:
* Infrastructure - this is more of the "physical" layer defining the networking layer, some independent services (e.g.
NAT gateways, VPN servers), security groups, and other core or non-BOSH functionality.
* BOSH - everything related to BOSH (e.g. director, deployment, snapshots, releases, stemcells) which is deployed onto
the infrastructure somewhere.
Since BOSH depends on some infrastructure, we'll get started with that first. One key to a `cloque`-managed environment
is that each environment has its own directory which includes a `network.yml` in the top-level. The network may be
located in a single datacenter, or it could span multiple countries. The file defines all the basics about the network
including subnets, reserved IPs, basic cloud properties, and some logical names.
I've committed an example network to the [`share`][7] directory within `cloque` and will use that in the examples here.
To get started, we'll copy the example and work with it...
# copy the sample environment
$ cp -r ~/cloque/share/example-multi ~/cloque-acme-dev
$ cd ~/cloque-acme-dev
# this will help the command know where to look for configs later
$ export CLOQUE_BASEDIR="$PWD"
If you take a look at the sample [`network.yml`][18], you'll see a couple regions with their individual network
segments, VPN networks, and a few reserved IP addresses which can be referenced elsewhere. Once `network.yml` is
created, the `utility:initialize-network` task can take care of bootstrapping the following:
* create stub folders for your different regions; e.g. `aws-apne1/core`, `global/private`)
* create a new SSH key (in `global/private/cloque-{yyyymmdd}*.pem`) and upload it to the AWS regions being used
* create a new IAM user, access key, and EC2 policy for BOSH to use
* create a certificate authority for [OpenVPN][8] usage
* create both client/server certificates for the inter-region VPN connections (requires interactive prompts for
passwords/confirmations)
* create an S3 bucket for shared configuration storage
When run, it assumes AWS credentials can be discovered from the environment...
$ cloque utility:initialize-network
> local:fs/global -> created
...snip...
> I created `utility:initiailize-network` because I found myself reusing keys and buckets across multiple environments
> (such as development vs production) because they were annoying to manage by hand. I wanted to make security easier
> for myself and, in the process, simplify the processes through automation.
The top-level `global` directory is intended for configuration which applies to all areas. With the example I use it to
create an additional IAM role which allows VPN gateways to securely download their VPN keys and configuration files...
$ ( cd global/core && cloque infra:put --aws-cloudformation 'Capabilities=["CAPABILITY_IAM"]' )
> validating...done
> checking...missing
> deploying...done
> waiting...CREATE_IN_PROGRESS...........................CREATE_COMPLETE...done
The `infra:put` is the core command responsible for managing the low-level, infrastructure-related resources. The
command looks for an `infrastructure.json` file (see the [example][27]) and since I'm focused on [AWS][4], the files
are [CloudFormation][5] scripts.
> One thing I dislike about BOSH is how it uses a state file or global options to specify the director/deployment. It
> makes it very inconvenient to quickly switch between directors/deployments even between multiple terminal sessions.
> To help with that, `cloque` respects environment variables (or command line options) to know where it should be
> working from. The `CLOQUE_BASEDIR` (exported earlier) is the most significant, and it was able to detect when it was
> working from the `global` region/director and `core` deployment based on the current directory.
Now that the global resources have been created, we can create our "core" resources for the `us-west-2` region. If you
take a look at the [infrastructure.json][28] file, you'll see it creates a VPC, multiple subnets for each availability
zone, a couple base security groups, and a gateway instance which will function as a VPN server to allow inter-region
communication. You'll also notice it's using [Twig][10] templating to load `network.yml` and simplify what would be a
lot of repeated resources. We'll use the `infra:put` command again, but this time within the `aws-usw2/core`
directory...
$ cd aws-usw2
$ ( cd core && cloque infra:put )
...snip...
> waiting...CREATE_IN_PROGRESS.........................CREATE_COMPLETE...done
> BOSH supports ERB-templated deploy manifests. With ERB I found myself repeating a lot of code in each manifest when
> trying to make it dynamic. After trying [spiff][21] (which I found a bit limited and difficult to understand), I
> decided to use a different approach - one that would allow for the same dynamic, peer-config referencing, and
> (later) transformational capabilities for both infrastructure configuration and BOSH deployment manifests.
Once the `infra:put` command finishes, the `aws-usw2` part of the environment is complete which means the OpenVPN
server is ready for a client. First we'll need to create and sign a client certificate though...
# temporary directory
$ mkdir tmp-myovpn
$ cd tmp-myovpn
# create a key (named after the hostname and current date)
$ TMPOVPN_CN=$(hostname -s)-$(date +%Y%m%da)
$ openssl req \
-subj "/C=US/ST=CO/L=Denver/O=ACME Inc/OU=client/CN=${TMPOVPN_CN}/emailAddress=`git config user.email`" \
-days 3650 -nodes \
-new -out openvpn.csr \
-newkey rsa:2048 -keyout openvpn.key
Generating a 2048 bit RSA private key
.............................+++
................+++
writing new private key to 'openvpn.key'
-----
# sign the certificate (you'll need to enter the PKI password you used in the first step)
$ cloque openvpn:sign-certificate openvpn.csr
# now create the OpenVPN configuration profile for connecting to aws-usw2
$ ( \
cloque openvpn:generate-profile aws-usw2 $TMPOVPN_CN \
; echo '<key>' \
; cat openvpn.key \
; echo '</key>' \
) > acme-dev-aws-usw2.ovpn
# opening should install it with a GUI connection manager like Tunnelblick
$ open acme-dev-aws-usw2.ovpn
# cleanup
$ cd ../
$ rm -fr tmp-myovpn
$ unset TMPOVPN_CN
> I created the `openvpn:sign-certificate` and, namely, `openvpn:generate-profile` commands to make the steps highly
> reproducible to encourage better certificate usage practices through it's "trivialness".
Since I'm using `example.com` in the `share` scripts as the domain, DNS won't resolve it. For now, the easiest solution
is to manually add an entry to `/etc/hosts`...
$ echo "`cd core && cloque infra:get '.Z0GatewayEipId'` gateway.aws-usw2.acme-dev.cloque.example.com" \
| sudo tee -a /etc/hosts
> The `infra:get` command allows me to programmatically fetch configuration details about the current deployment. For
> infrastructure, this allows me to extract the created resource IDs/names using [jq][12] statements. This makes it
> extremely easy to automate basic lookup tasks (as in this case), but also allows for more complex IP or security
> group enumeration which can be used for other composable, automated tasks.
Once `/etc/hosts` is updated, I can connect with an OpenVPN client like [Tunnelblick][13] and ping the network...
$ ping -c 5 10.101.0.4
PING 10.101.0.4 (10.101.0.4): 56 data bytes
64 bytes from 10.101.0.4: icmp_seq=0 ttl=64 time=59.035 ms
64 bytes from 10.101.0.4: icmp_seq=1 ttl=64 time=61.288 ms
64 bytes from 10.101.0.4: icmp_seq=2 ttl=64 time=78.194 ms
64 bytes from 10.101.0.4: icmp_seq=3 ttl=64 time=57.850 ms
64 bytes from 10.101.0.4: icmp_seq=4 ttl=64 time=57.956 ms
--- 10.101.0.4 ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 57.850/62.865/78.194/7.764 ms
## BOSH Director
Now that we have a VPC and a private network to deploy things into, we can start a BOSH Director. Here it's important
to note that I'm using "region", "network segment", and "director" interchangeably. Typically you'll have a single BOSH
Director within an environment's region, and since that Director will tag it's deployment resources with a "director"
tag, I decided to make them all synonyms. The effect is twofold:
* when you see a "director" name (whether it's in the context of BOSH or not) it refers to where resources are
provisioned
* you can consistently use a "director" tag (BOSH or not) to identify where something is deployed which makes AWS
resource management much simpler (and AWS Billing reports by tag much more valuable).
Back to getting BOSH deployed though. First, we'll create some additional BOSH-specific, region-specific infrastructure
(specifically, security groups for the director and agents)...
$ ( cd bosh && cloque infra:put )
...snip...
> waiting...CREATE_IN_PROGRESS...............CREATE_COMPLETE...done
> Here I start using the `bosh` directory. I put Director-related configuration in the `bosh` deployment. Individual
> BOSH deployments get their own directory.
Once the security groups are available, we can create the BOSH Director. The `boshdirector:*` commands deal with the
Director tasks (i.e. they don't depend on a specific deployment). To get started, the `boshdirector:inception:start`
command takes care of provisioning the inception instance (it takes a few minutes to get everything installed and
configured)...
$ cloque boshdirector:inception:start \
--security-group $( cloque --deployment=core infra:get '.TrustedPeerSecurityGroupId' ) \
--security-group $( cloque --deployment=core infra:get '.PublicGlobalEgressSecurityGroupId' ) \
$( cloque --deployment=core infra:get '.SubnetZ0PublicId' ) \
t2.micro
> finding instance...missing
> instance-id -> i-f84169f3
> tagging director -> acme-dev-aws-usw2
> tagging deployment -> cloque/inception
> tagging Name -> main
> waiting for instance...pending.........running...done
> waiting for ssh.......done
> installing...
...snip...
> uploading compiled/self...
...snip...
> uploading global/private...
...snip...
> You'll notice the `cloque --deployment=core infra:get` usage to to load the security groups. The `--deployment`
> option is an alternative to running `cd ../core` before the command. Another alternative would be to use the
> `CLOQUE_DEPLOYMENT` environment variable. Whatever the case, `cloque` is intelligent and flexible about figuring out
> where it should be working from.
Before continuing, there's still a manual process of finding the correct stemcell. If we were in `us-east-1`, we could
use the "light-bosh" stemcell (which is really just an alias to a pre-compiled AMI that Cloud Foundry publishes).
Unfortunately, we need to take the slower route of compiling our own AMI for `us-west-2`. To do this, we need to lookup
the latest stemcell URL from the [published artifacts][15], then we pass that URL to the next command...
$ cloque boshdirector:inception:provision \
https://s3.amazonaws.com/bosh-jenkins-artifacts/bosh-stemcell/aws/bosh-stemcell-2710-aws-xen-ubuntu-trusty-go_agent.tgz
> finding instance...found
> instance-id -> i-f84169f3
> deploying...
WARNING! Your target has been changed to `https://10.101.16.8:25555'!
Deployment set to '/home/ubuntu/cloque/self/bosh/bosh.yml'
Verifying stemcell...
File exists and readable OK
Verifying tarball...
Read tarball OK
Manifest exists OK
Stemcell image file OK
Stemcell properties OK
Stemcell info
-------------
Name: bosh-aws-xen-ubuntu-trusty-go_agent
Version: 2710
Started deploy micro bosh
Started deploy micro bosh > Unpacking stemcell. Done (00:00:18)
Started deploy micro bosh > Uploading stemcell. Done (00:05:16)
Started deploy micro bosh > Creating VM from ami-8fe7a1bf. Done (00:00:19)
Started deploy micro bosh > Waiting for the agent. Done (00:01:19)
Started deploy micro bosh > Updating persistent disk
Started deploy micro bosh > Create disk. Done (00:00:02)
Started deploy micro bosh > Mount disk. Done (00:00:09)
Done deploy micro bosh > Updating persistent disk (00:00:19)
Started deploy micro bosh > Stopping agent services. Done (00:00:01)
Started deploy micro bosh > Applying micro BOSH spec. Done (00:00:21)
Started deploy micro bosh > Starting agent services. Done (00:00:01)
Started deploy micro bosh > Waiting for the director. Done (00:00:19)
Done deploy micro bosh (00:08:13)
Deployed `bosh/bosh.yml' to `https://10.101.16.8:25555', took 00:08:13 to complete
> fetching bosh-deployments.yml...
receiving file list ...
1 file to consider
bosh-deployments.yml
1025 100% 1000.98kB/s 0:00:00 (xfer#1, to-check=0/1)
sent 38 bytes received 723 bytes 101.47 bytes/sec
total size is 1025 speedup is 1.35
> tagging...done
> The `:start` command took care of pushing the compiled manifest, but this `:provision` command is responsible for
> pushing everything to the director and, once complete, downloading the resulting configuration locally. I created
> these two commands because they were a common task and the manual, iterative process was getting tiresome. It also
> helps unify both the intitial provisioning vs upgrade process *and* deploying from AMI vs TGZ. Instead of ~12 manual
> steps spread out over ~30 minutes, I only need to intervene at three points (including instance termination).
Once the provisioning step is complete, I can login and talk to BOSH...
# default username/password is admin/admin
$ bosh target https://10.101.16.8:25555
$ bosh status
Config
/Users/dpb587/cloque-acme-dev/aws-usw2/.bosh_config
Director
Name acme-dev-aws-usw2
URL https://10.101.16.8:25555
Version 1.2710.0 (00000000)
User admin
UUID f38d685c-9a72-4fc0-bc84-558979cc80bf
CPI aws
dns enabled (domain_name: microbosh)
compiled_package_cache disabled
snapshots disabled
Deployment
not set
Since BOSH Director is successfully running, it's safe to terminate the inception instance. Whenever there's a new BOSH
version I want to deploy, I can just rerun the two `start` and `provision` commands (with an updated stemcell URL)
and it will take care of upgrading it.
### More on Stemcells
While inception was deploying the BOSH Director, it ended up making a stemcell that I can reuse for our BOSH
deployments. Unfortunately, the Director doesn't know about it. The following command takes care of publishing it...
$ cloque boshutil:create-bosh-lite-stemcell-from-ami \
https://s3.amazonaws.com/bosh-jenkins-artifacts/bosh-stemcell/aws/light-bosh-stemcell-2710-aws-xen-ubuntu-trusty-go_agent.tgz \
ami-8fe7a1bf
Uploaded Stemcell: https://example-cloque-acme-dev.s3.amazonaws.com/bosh-stemcell/aws/us-west-2/light-bosh-stemcell-2710-aws-xen-ubuntu-trusty-go_agent.tgz
> The command uses the URL (the light-bosh stemcell of the same version from the [artifacts][15] page) as a template
> and patches in the correct metadata for the local region. It then takes care of uploading it to the environment's S3
> bucket and to the Director so it's immediately usable.
Another task I frequently need to do is convert the standard stemcells (which only support the PV virtualization) into
HVM stemcells that I can use with AWS's newer instance types. This next command takes care of all those steps
and, once complete, there will be a new `*-hvm` stemcell ready for use on the Director.
$ cloque boshutil:convert-pv-stemcell-to-hvm \
https://example-cloque-acme-dev.s3.amazonaws.com/bosh-stemcell/aws/us-west-2/light-bosh-stemcell-2710-aws-xen-ubuntu-trusty-go_agent.tgz \
ami-d13845e1 \
$( cloque --deployment=core infra:get '.SubnetZ0PrivateId , .TrustedPeerSecurityGroupId' )
Created AMI: ami-f3e3a5c3
Uploaded Stemcell: https://example-cloque-acme-dev.s3.amazonaws.com/bosh-stemcell/aws/us-west-2/light-bosh-stemcell-2710-aws-xen-ubuntu-trusty-go_agent-hvm.tgz
> The command needs the light-bosh TGZ and AMI for the existing PV stemcell as well as a subnet and security group for
> it to provision the conversion instances in.
## BOSH Deployment
Now that the BOSH Director is running, I can deploy something interesting onto it. Let's use [logearch][2] as an
example. First I'll need to clone the repository...
$ git clone https://github.com/logsearch/logsearch-boshrelease.git ~/logsearch-boshrelease
$ cd ~/logsearch-boshrelease
Since I've changed directories away from our environment, `cloque` will no longer know where to find its environment
information. To help, I'll use a `.env` file...
$ ( \
echo 'export CLOQUE_BASEDIR=~/cloque-acme-dev' \
; echo 'export CLOQUE_DIRECTOR=aws-usw2' \
; echo 'export CLOQUE_DEPLOYMENT=logsearch' \
) > .env
> I mentioned before that `cloque` uses the current working directory, environment variables, and command options to
> figure out where to look for things. If it's still missing information, it will check and load a `.env` file from
> the current directory as a last resort. This is normally only useful during development where I already use `.env`
> for other project-specific BASH `alias`es and variables.
Now I can upload the release...
$ cloque boshdirector:releases:put releases/logsearch-latest.yml
> Since releases are Director-specific and unrelated to a particular deployment, It uses the `boshdirector:*`
> namespace.
The example has the configuration files for infrastructure (EIP and security groups) and BOSH (deploy manifest), but
I still need to generate a certificate locally...
$ openssl req -x509 -newkey rsa:2048 -nodes -days 3650 \
-keyout ~/cloque-acme-dev/aws-usw2/ssl.key \
-out ~/cloque-acme-dev/aws-usw2/ssl.crt
> Having a directory per deployment helps keep everything scoped and organized when there are additional artifacts.
> The templating nature of `cloque` allows the files to be embedded into its own deployment manifest, but also other
> deployment manifests. With the example of logsearch, this means I don't need to copy and paste the `ssl.crt` into
> other deployments, just embed it using a relative path (embeds are always relative to the config file - something
> BOSH ERBs struggle with): `{% raw %}{{ env.embed('../logsearch/ssl.crt') }}{% endraw %}`.
Once uploaded, I can use the `infra:put` and mirrored `bosh:put` command to push the infrastructure and BOSH
deployment (`-n` meaning non-interactive, just like with `bosh`)...
$ cloque infra:put
...snip...
> waiting...CREATE_IN_PROGRESS.....................CREATE_COMPLETE...done
$ cloque -n bosh:put
Getting deployment properties from director...
...snip...
Deployed `bosh.yml' to `acme-dev-aws-usw2'
Once complete, I can see the [elasticsearch][19] service running...
$ wget -qO- '10.101.17.26'
{
"status" : 200,
"name" : "elasticsearch/0",
"version" : {
"number" : "1.2.1",
"build_hash" : "6c95b759f9e7ef0f8e17f77d850da43ce8a4b364",
"build_timestamp" : "2014-06-03T15:02:52Z",
"build_snapshot" : false,
"lucene_version" : "4.8"
},
"tagline" : "You Know, for Search"
}
And I can see the ingestor listening on its EIP:
$ echo 'QUIT' | openssl s_client -showcerts -connect $( cloque infra:get '.Z0IngestorEipId' ):5614
CONNECTED(00000003)
And I can SSH into the instance...
$ cloque bosh:ssh
...snip...
bosh_j51114xze@c989cf2f-91e4-407e-a7d7-bdc03ef79511:~$
> The `bosh:ssh` command is a little more intelligent than `bosh ssh`. It will peek at the manifest to know if there's
> only a single job running, in which case the job/index argument becomes meaningless. Additionally, it always will
> use a default `sudo` password of `c1oudc0w` (avoiding the interactive delay and prompt that `bosh ssh` requires).
## Package Development
When I need to create a new package, I started using a convention where I'd add the origin URL where I found a
blob/file. This provides me with more of an audit over time, but also allows me to automate a `spec` file which looks
like:
---
name: "nginx"
files:
# http://nginx.org/download/nginx-1.7.2.tar.gz
- "nginx-blobs/nginx-1.7.2.tar.gz"
# ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.35.tar.gz
- "nginx-blobs/pcre-8.35.tar.gz"
# https://www.openssl.org/source/openssl-1.0.1h.tar.gz
- "nginx-blobs/openssl-1.0.1h.tar.gz"
...snip...
Into a series of `wget`s with the `boshutil:package-downloads` command...
$ cloque boshutil:package-downloads nginx
mkdir -p 'blobs/nginx-blobs'
[ -f 'blobs/nginx-blobs/nginx-1.7.2.tar.gz' ] || wget -O 'blobs/nginx-blobs/nginx-1.7.2.tar.gz' 'http://nginx.org/download/nginx-1.7.2.tar.gz'
[ -f 'blobs/nginx-blobs/pcre-8.35.tar.gz' ] || wget -O 'blobs/nginx-blobs/pcre-8.35.tar.gz' 'ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.35.tar.gz'
[ -f 'blobs/nginx-blobs/openssl-1.0.1h.tar.gz' ] || wget -O 'blobs/nginx-blobs/openssl-1.0.1h.tar.gz' 'https://www.openssl.org/source/openssl-1.0.1h.tar.gz'
...snip...
> I was tired of having to manually download files, `bosh add blob` them with the correct parameters and then having
> to manually delete the originals. This lets me completely avoid that step and ensures I'm using the files I expect.
> Whenever a blob is an internal file or `src`, I just take care of it manually like before.
When I'm working on a `packaging` script I use [Docker][22] images to emulate the build environment. Since 99% of my
build issues come from `configure` arguments and environment variables, this is normally sufficient. This also lets me
iteratively debug my packaging scripts as opposed to the slow, guess and check method of re-releasing and deploying the
whole thing to BOSH to test fixes. The `boshutil:package-docker-build` command helps me here...
$ cloque boshutil:package-docker-build ubuntu:trusty nginx
> compile/packaging...done
> compile/nginx-blobs/nginx-1.7.2.tar.gz...done
> compile/nginx-blobs/pcre-8.35.tar.gz...done
> compile/nginx-blobs/openssl-1.0.1h.tar.gz...done
...snip...
Sending build context to Docker daemon 7.571 MB
Sending build context to Docker daemon
Step 0 : FROM ubuntu:trusty
---> ba5877dc9bec
Step 1 : RUN apt-get update && apt-get -y install build-essential cmake m4 unzip wget
...snip...
root@347c1d4ca07b:/var/vcap/data/compile/nginx#
> This command mirrors the BOSH environment by using the `spec` file to add the referenced blobs, uploads the
> packaging script, configures the `BOSH_COMPILE_TARGET` and `BOSH_INSTALL_TARGET` variables, creates the directories,
> and switches to the compile directory, ready for me to type `./packaging` or paste commands iteratively. It also has
> the `--import-package` and `--export-package` options to import/dump the resulting `/var/vcap/packages/{name}`
> directory to support dependencies.
## Snaphots
One easy feature that BOSH has is snapshotting to get a full backup of its persistent disks. You can run its `take
snapshot` command for a particular job or for an entire deployment. Or, if "dirty" snapshots are okay, the Director can
schedule them automatically. To manage all those snapshots, I created a few commands. The first command takes care of
snapshots that the BOSH Director creates of itself...
$ cloque boshdirector:snapshots:cleanup-self 3d
snap-4219f4fb -> 2014-09-13T06:01:14+00:00 -> deleted
snap-2e6588e4 -> 2014-09-13T06:03:55+00:00 -> deleted
snap-1acd90d3 -> 2014-09-13T06:06:36+00:00 -> deleted
snap-618c7da9 -> 2014-09-14T06:01:15+00:00 -> retained
snap-dce22315 -> 2014-09-14T06:03:55+00:00 -> retained
snap-a9e81a60 -> 2014-09-14T06:06:35+00:00 -> retained
snap-d35ea51a -> 2014-09-15T06:01:18+00:00 -> retained
snap-3742b88e -> 2014-09-15T06:03:58+00:00 -> retained
snap-0b8b40c2 -> 2014-09-15T06:06:38+00:00 -> retained
snap-ea16dfd3 -> 2014-09-16T06:01:18+00:00 -> retained
snap-913df459 -> 2014-09-16T06:03:58+00:00 -> retained
snap-82d5fc4b -> 2014-09-16T06:06:38+00:00 -> retained
> This command is simplistic and trims all snapshots earlier than a given period (in this case three days). I got very
> tired and forgetful about regularly cleaning up snapshots from the AWS Console. It communicates directly with the
> AWS API since the `bosh` command doesn't seem to enumerate them.
The command for individual deployment snapshots is a bit more intelligent. It allows writing logic which, when passed a
given snapshot, determines whether it should be retained or deleted. For example...
$ cloque boshdirector:snapshots:cleanup
...snip...
snap-7837f7d4 -> 2014-08-01T07:01:30+00:00 -> dirty -> retained
snap-62cca4de -> 2014-08-04T07:00:28+00:00 -> dirty -> retained
snap-bdd29512 -> 2014-08-04T22:51:57+00:00 -> clean -> retained
snap-4dd5a3e1 -> 2014-08-04T23:46:23+00:00 -> clean -> retained
snap-2bb7c784 -> 2014-08-11T07:00:46+00:00 -> dirty -> retained
snap-5239b7fc -> 2014-08-18T07:00:40+00:00 -> dirty -> retained
snap-cf6fcb6e -> 2014-08-25T07:00:39+00:00 -> dirty -> retained
snap-9d00103c -> 2014-08-28T13:34:39+00:00 -> clean -> retained
snap-9d80103d -> 2014-09-01T07:00:43+00:00 -> dirty -> retained
snap-79c18cda -> 2014-09-08T07:00:44+00:00 -> dirty -> retained
snap-87f47a24 -> 2014-09-09T07:00:57+00:00 -> dirty -> deleted
snap-5fec87fc -> 2014-09-10T07:00:55+00:00 -> dirty -> retained
snap-bdfeda1e -> 2014-09-11T07:00:58+00:00 -> dirty -> retained
snap-246b6987 -> 2014-09-12T07:00:54+00:00 -> dirty -> retained
snap-c234d870 -> 2014-09-13T07:00:43+00:00 -> dirty -> retained
snap-28ed128a -> 2014-09-14T07:00:55+00:00 -> dirty -> retained
snap-ef6ac34d -> 2014-09-15T07:00:55+00:00 -> dirty -> retained
snap-72c156d3 -> 2014-09-16T07:00:42+00:00 -> dirty -> retained
> The command looks for a deployment-specific file which receives information about the snapshot (ID, date,
> clean/dirty) and returns `true` to cleanup/delete or `false` to retain. This allows me to create some very custom
> retention policies for individual deployments, depending on their requirements. In this example, clean snapshots are
> kept 3 months, Mondays are kept for 6 months, first of month is kept indefinitely, everything else kept for 1 week.
## Revitalizing
In the past I've typically used local VMs with [VirtualBox][23] or [VMWare Fusion][24] for personal development.
Unfortunately they always seemed to drift from production servers, which made things inconvenient, at best. With BOSH,
it became trivial for me to start/stop deployments and guarantee they have a known environment. When my VMs were local
I always had scripts which would pull down backups, restore them, and clean up data for development. With `cloque` I've
been using a `revitalize` concept which allows me to restore data from snapshots or run arbitrary commands. For
example, I can add the following to my database job to restore data from a slave's most recent snapshot...
jobs:
- name: "mysql"
...snip...
cloque.revitalize:
- method: "snapshot_copy"
director: "example-acme-aws-usw2"
deployment: "wordpress-demo-hotcopy"
job: "mysql"
- method: "script"
script: "{{ env.embed('revitalize.sh') }}"
> The `snapshot_copy` method takes care of finding the most recent snapshot with the given parameters and would copy
> the data onto the local `/var/vcap/store` directory (trashing anything it replaces). The `script` method allows an
> arbitrary script to run, in this case, one that resets the MySQL users/passwords and cleans data for development
> purposes.
Whenever I want to reload my dev deployment with more recent production data (or after I've sufficiently polluted my
dev data), I can just run the `bosh:revitalize` task...
$ cloque bosh:revitalize
> mysql/0
> finding 10.101.17.41...
> instance-id -> i-fe0e23f3
> availability-zone -> us-west-2w
> stopping services...
> waiting...............done
> snapshot_copy
> finding snapshot...
> snapshot-id -> snap-3867159a
> start-time -> 2014-09-16T06:58:31.000Z
> creating volume...
> volume-id -> vol-edc5bfe9
> waiting...creating...available...done
> attaching volume...
> waiting...in-use...done
> mounting volume...
> transferring data...
> removing mysql...done
> restoring mysql...done
> unmounting volume...
> detaching volume...
> waiting...in-use......available...done
> destroying volume...
> script...
> starting services...
...snip...
> This also makes it easy for me to condense services which run on multiple machines in production onto a single
> machine for development by restoring from multiple snapshots (as long as the services `store` directories are
> properly named).
## Configuration Transformations
I mentioned earlier that configuration files are templates. In addition to basic templating capabilities, I added some
transformation options. Transformations allow a processor to receive the current state of the configuration, do some
magic to it, and return a new configuration. The easiest example of this is with logging - I want to centralize all my
log messages and [`collectd`][26] measurements. Here I'll use [logsearch-shipper-boshrelease][25], but regardless of
how it's done, it typically requires adding a new release to your deployment, adding the job template to every job, and
adding the correct properties. When you have multiple deployments, this becomes tedious and this is where a
transformation shines. The transform could take care of the following:
* adding the `logsearch` properties (SSL key, `bosh_director` field to messages, EIP lookup for the ingestor)
* add the `logsearch-shipper` release to the deployment
* add the `logsearch-shipper` job template to every job
And raw code for that transform could go in `aws-usw2/logsearch/shipper-transform.php`:
<?php return function ($config, array $options, array $params) {
// add our required properties
$config['properties']['logsearch'] = [
'logs' => [
'_defaults' => implode("\n", [
'---',
'files:',
' "**/*.log":',
' fields:',
' type: "unknown"',
' bosh_director: "' . $params['network_name'] . '-' . $params['director_name'] . '"',
]),
'server' => $params['env']['self/infrastructure/logsearch']['Z0IngestorEipId'] . ':5614',
'ssl_ca_certificate' => $params['env']->embed(__DIR__ . '/ssl.crt'),
],
'metrics' => [
'frequency' => 60,
],
];
// add the template job to all jobs
foreach ($config['jobs'] as &$job) {
$job['templates'][] = [
'release' => 'logsearch-shipper',
'name' => 'logsearch-shipper',
];
}
// add the release, if it's not explicitly using a version
if (!in_array('logsearch-shipper', array_map(function ($a) { return $a['name']; }, $config['releases']))) {
$config['releases'][] = [
'name' => 'logsearch-shipper',
'version' => '1',
];
}
return $config;
};
And then whenever I want a deployment to forward its logs with `logsearch-shipper`, I only need to add the following to
the root level of my `bosh.yml` deployment manifest...
_transformers:
- path: "../logsearch/shipper-transform.php"
> This approach helps me keep my deployment manifests concise. Rather than clutter up my definitions with ancillary
> configuration and sidekick jobs, they remain focused on the services they're actually providing.
## Tagging
Since starting with BOSH, I've used AWS tags more heavily. I consistently use the `director` tag to represent the
`{network_name}-{region_name}` (e.g. `acme-dev-aws-usw2`) and the `deployment` tag to represent the logical set of
services (regardless of whether BOSH is managing them or not). I made another command which can enumerate relevant
resources and ensure they have the expected tags:
$ cloque utility:tag-resources
> reviewing us-west-2...
> acme-dev-aws-usw2/bosh/microbosh -> i-298fb0c6
> /dev/xvda -> vol-d46fa79b
> adding director -> acme-dev-aws-usw2
> adding deployment -> microbosh
> adding Name -> microbosh/0/xvda
> /dev/sdb -> vol-8b6c46c6
> adding director -> acme-dev-aws-usw2
> adding deployment -> microbosh
> adding Name -> microbosh/0/sdb
> /dev/sdf -> vol-8a6d46c6
> adding director -> acme-dev-aws-usw2
> adding deployment -> microbosh
> adding Name -> microbosh/0/sdf
> acme-dev-aws-usw2/logsearch/main/0 -> i-46be80b9
> /dev/sda -> vol-fa4e57b5
> adding director -> acme-dev-aws-usw2
> adding deployment -> logsearch
> adding Name -> main/0/sda
> /dev/sdf -> vol-73e0ce3e
> acme-dev-aws-usw2/infra/core/z1/gateway -> i-8d60f6a2
> /dev/sda1 -> vol-7b5b7838
> I added this command because I wanted to be sure my volumes were all accurately tagged. This helps me when using the
> AWS Console, but it also provides more detail in the AWS Billing Reports when the `director` and `deployment` tags
> are included for detailed billing.
## Conclusion
BOSH is far from perfect, in my mind, but with a little help it is enabling me to be more productive and effective
than other tools I've tried in the areas which are most important to me.
[1]: http://docs.cloudfoundry.org/bosh/
[2]: https://github.com/logsearch/logsearch-boshrelease
[3]: https://github.com/dpb587/cloque
[4]: http://aws.amazon.com/
[5]: http://aws.amazon.com/cloudformation/
[6]: http://www.terraform.io/
[7]: https://github.com/dpb587/cloque/blob/master/share/
[8]: http://openvpn.net/
[9]: https://github.com/dpb587/cloque/blob/master/share/local-core-infrastructure.yml
[10]: http://twig.sensiolabs.org/
[11]: http://console.aws.amazon.com/
[12]: http://stedolan.github.io/jq/
[13]: https://code.google.com/p/tunnelblick/
[14]: https://gist.githubusercontent.com/dpb587/c0427635b3316584e12e/raw/183ccda6c504fac02754b79b5a5b267848a70025/transfer-ami.sh
[15]: http://bosh_artifacts.cfapps.io/
[16]: https://github.com/cloudfoundry/bosh/tree/master/bosh_cli_plugin_micro
[18]: https://github.com/dpb587/cloque/blob/master/share/example-multi/network.yml
[19]: http://www.elasticsearch.org/
[20]: /blog/2014/02/28/distributed-docker-containers.html#the-alternatives
[21]: https://github.com/cloudfoundry-incubator/spiff
[22]: https://www.docker.com/
[23]: https://www.virtualbox.org/
[24]: http://www.vmware.com/products/fusion
[25]: https://github.com/logsearch/logsearch-shipper-boshrelease/
[26]: http://collectd.org/
[27]: https://github.com/dpb587/cloque/blob/master/share/example-multi/global/core/infrastructure.json
[28]: https://github.com/dpb587/cloque/blob/master/share/example-multi/aws-usw2/core/infrastructure.json

View File

@@ -1,10 +0,0 @@
---
title: "Colorado Aspens"
layout: "post"
tags: [ "aspen", "autumn", "colorado", "photo-gallery" ]
description: "A non-technical post with pictures of the changing Aspens in Colorado."
---
Colorado is usually a beautiful place, but especially in Autumn when the Aspens are turning&hellip;
{% include gallery_list640w.html gallery='2014-colorado-aspens' %}

View File

@@ -1,138 +0,0 @@
---
title: "Logging logging and Finding Bottlenecks"
layout: "post"
tags: [ "elasticsearch", "kibana", "logsearch", "logstash", "metrics", "queue", "regex", "slow logstash" ]
description: "Some ways logsearch is measuring its own performance with the elasticsearch+logstash+kibana stack."
primary_image: /blog/2014-11-14-logging-logging-and-finding-bottlenecks/parsed-messages.jpg
---
I've been doing quite a bit of work with the ELK stack ([elasticsearch][1], [logstash][2], [kibana][3]) through the
[logsearch][4] project. As we continued to scale the stack to handle more logs and log types, we started having
difficulty identifying where some of the bottlenecks were occuring. Our most noticeable issue was that occasionally the
load on our parsers would spike for sustained periods, causing our queue to get backed up and real-time processing to
get significantly delayed. We were able to see when our queue size was growing, but I needed to find better metrics
which would demonstrate our real issue.
<img alt="Screenshot: slow queue" src="{{ site.asset_prefix }}/blog/2014-11-14-logging-logging-and-finding-bottlenecks/slow-queue.jpg" width="628" />
## The Event Lifecycle
For non-trivial ELK stacks, there are typically a few services that a message hits between being a line in a log file
and a plotted point on a Kibana graph. With logsearch, and logstash in general, those services are:
0. The Shippers - are responsible for getting log messages into logsearch (e.g. tailing log files with [nxlog][6]) by
pushing them to...
0. The Ingestors - which listen for those messages on various ports for various protocols (e.g. syslog). Rather than
trying to immediately parse messages and be a bottleneck, it pushes messages into...
0. The Queue - which helps buffer against degraded performance from large spikes. In logsearch, this is [redis][5].
For real-time processing, the queue is typically empty because the messages should immediately be pulled by...
0. The Parsers - which are responsible for parsing/extracting/transforming the log messages into something searchable.
Typically, there are numerous parser rules for the various types of log files. Once parsed, they get pushed to...
0. The Data Store - where the parsed message lives in elasticsearch for the rest of its life, searchable by tools like
Kibana.
In our situation, we could see that the parsers were becoming the bottleneck. Despite relatively consistent logging
rates, the CPU loads would max out and messages were reaching elasticsearch at very slow rates. As a short-term fix,
we could easily start up several more parsers which helped a little bit, but this required manual intervention and
wasn't actually fixing the problem.
## Areas to Profile
Logstash itself has a `--debug` option which will dump details about every input, filter, and output each event
hits. This is helpful when testing individual events, but in a production environment with thousands of events per
minute it just became too noisy to be useful. We needed a different solution.
Typically, when all is said and done, we only have one timestamp to look at: `@timestamp` as extracted from the log
message and indicating when the log message was originally emitted. However, when the bottlenecks were occurring, there
was up to an hour delay between seeing the messages in dashboards and we had no way to measure how long messages were
stuck nor see where they were stuck. We decided to inject a few more fields into events...
First, we wanted to know when log messages were first entering our logsearch stack. This would help validate that our
shippers are pushing data into the cluster in a timely manner (rather than significant batching or simply getting
stuck). To do this, I configured ingestors to add the current time to every message when it came in. I also added
fields documenting which BOSH job received the message to help us keep an eye on how balanced the ingestors may be.
So, now our messages have a few additional fields...
* `@ingestor[timestamp]` - the time the ingestor saw the event (e.g. `2014-11-14T12:02:36.181Z`)
* `@ingestor[job]` - the job which ingested the event (e.g. `ingestor/1`)
* `@ingestor[service]` - which logsearch job template received the message (e.g. `syslog`)
The next step in the lifecycle was the queue. The easiest way to monitor how long a message stayed in the queue is to
add another timestamp right when the parser shifts the message off the queue. Since we have multiple parsers running, I
configured them to also add their BOSH job name as a field. With the working theory that some of our parser rules were
especially inefficient, I also added a final timestamp at the very end of the parsing rules. This would let us compare
start/end parser timestamps. Now messages have a few more fields...
* `@parser[timestamp]` - the time the parser saw the event (e.g. `2014-11-14T12:02:36.450Z`)
* `@parser[job]` - the job which parsed the event (e.g `parser-z1/3`)
* `@parser[timestamp_done]` - the time when the parser finished parsing the event (e.g. `2014-11-14T12:02:36.462Z`)
With those 6 new fields, the event now has some very valuable metadata that we can review. However, the information
would be much more valuable if we could easily and aggregate and graph individual events. So I added a bit more
overhead with math and graphable fields...
* `@parser[duration]` - instead of `timestamp_done`, switch to the duration the parser took (e.g. `12`)
* `@timer[ingested_to_parsed]` - essentially the time our logsearch stack spent on the event from when we first
saw it to (roughly) when the end user should be able to search it (e.g. `281`)
* `@timer[emit_to_ingested]`, `@timer[emit_to_parsed]` - if the conventional `@timestamp` field is parsed out of the
log message, we can use that as an absolute starting point and get further insight into how slow shippers are to
send the message (e.g. `301`, `582`)
## Graphing Bottlenecks
After deploying the changes we were able to make some new Kibana dashboards to help visualize all our new metrics.
Since parsers seemed to be the bottleneck, we first wanted to monitor how many messages the jobs were actually parsing
at a given time...
<img alt="Screenshot: parsed messages" src="{{ site.asset_prefix }}/blog/2014-11-14-logging-logging-and-finding-bottlenecks/parsed-messages.jpg" width="628" />
During light loads where everything would be processing in real-time, we expected it to fully mirror our other chart
measuring the rates we were receiving the messages...
<img alt="Screenshot: ingested messages" src="{{ site.asset_prefix }}/blog/2014-11-14-logging-logging-and-finding-bottlenecks/ingested-messages.jpg" width="628" />
Historically our spikes seemed random, so we started segmenting the average parse times by log types under the theory
that some particular log was sending confusing messages. Our average time was around 10 ms, but after splitting by type
we saw one log type was averaging more than one second (per message)...
<img alt="Screenshot: parsing duration before" src="{{ site.asset_prefix }}/blog/2014-11-14-logging-logging-and-finding-bottlenecks/parsing-duration-before.jpg" width="628" />
Clearly this would cause all of our parsing to slow down whenever that log suddenly saw a lot of activity. Now that we
could find slow log messages, we were able to use them to track down some extremely non-performant regular expressions
in one of our `grok` filters. After deploying the updated filters, we started seeing *much* more consistent parsing
results among all our log types...
<img alt="Screenshot: parsing duration after" src="{{ site.asset_prefix }}/blog/2014-11-14-logging-logging-and-finding-bottlenecks/parsing-duration-after.jpg" width="628" />
## Conclusion
I learned a few things from all this. Most notably is how invaluable it is to be able to inject profiling into various
steps of an otherwise unmeasured lifecycle. Obviously this adds a bit of processing and storage overhead into the
stack, but since we haven't noticed a large impact in our day-to-day usage we've kept the extra profiling enabled.
Although we have yet to experience another incident of a poorly performing parser, we're ready with metrics when we do.
In the meantime, we use it to more easily monitor the practical capacity of our logstash components.
This became a great example about how such a relatively minor bug can be compounded and multiplied into bigger issues.
A single log message taking 2 seconds isn't a big deal, even when you have 1000 other log messages/sec coming in - at
worst you briefly lag by a couple seconds. If you have 10 parsers running it isn't even noticeable because the other 9
parsers pick up the slack. But if all of a sudden you get 100 log messages hitting the slow bug, those 10 parsers will
each spend 20 seconds working through those slow messages and, once they finish those 100, there will be 20,000
messages waiting in the queue.
Whether it's the [dashboards][7] we use to self-monitor, the [filters][8] we build app-specific parsers of off, or
this new [profiling configuration][9] that we were motivated to work on -- I enjoy being in a role where these
experiences can be codified, committed, and published in an open-source manner.
[1]: http://www.elasticsearch.org/overview/elasticsearch/
[2]: http://www.elasticsearch.org/overview/logstash/
[3]: http://www.elasticsearch.org/overview/kibana/
[4]: https://github.com/logsearch/logsearch-boshrelease
[5]: http://redis.io/
[6]: http://nxlog-ce.sourceforge.net/
[7]: https://github.com/logsearch/logsearch-boshrelease/tree/develop/share/kibana-dashboards
[8]: https://github.com/logsearch/?query=logsearch-filters
[9]: https://github.com/logsearch/logsearch-boshrelease/pull/79/commits

View File

@@ -1,92 +0,0 @@
---
title: "Sending Work from a Web Application to Desktop Applications"
layout: "post"
tags: [ "applescript", "automation", "aws-sqs", "box", "dymo", "endicia", "hazel", "launchd", "osx", "phar", "php", "usps" ]
description: "Using queues and PHP to automate third-party applications running on staff workstations."
code: https://github.com/theloopyewe/elfbot
---
I prefer working on the web application side of things, but there are frequently tasks that need to be automated outside the context of a browser and server. For [TLE][10], there's a physical shop where inventory, order, and shipping tasks need to happen, and those tasks revolve around web-based systems of one form or another. To help unify and simplify things for the staff (aka [elves][11]), I've been connecting scripts on the workstations with internal web applications via queues in the cloud.
## Evolution of a bot
Over the past 8+ years, the need for running commands on the desktop has changed. The easiest example to follow is how we have printed shipping labels over the years:
0. For the first few months, we would copy/paste the address into the [USPS Print & Ship][1] website, click through the shipping options ourselves, print out a label on sticky paper with inkjet, and copy/paste back the delivery confirmation into an order note. Averaging a few orders a day, it was quite manageable.
0. With more orders we needed something faster, so I created a form posting to USPS which prefilled all the fields. This way, all we needed to do was confirm/print and copy/paste the delivery confirmation back. That helped for a bit longer.
0. With a growing number of orders, we still needed something more, so we switched to [Endicia][2], a desktop application which had several integration options and the ability to print directly to a label printer. I switched from USPS links/forms to pre-composed links using Endicia's [custom URI handler][3]. This helped speed things up, save money on label paper, and also automatically copied confirmation codes for us to paste.
0. Occasionally we would have a couple problems with the URI approach, so I changed to using file downloads:
1. Instead of Endicia's custom URI handler, the server would send a file download with the XML-based postage details.
1. Using the [watched folder approach][4], OS X would notice the new file and send it to Endicia for printing.
0. This worked fairly well, but we quickly ran into a few quirks related to AppleScript's watched folder features and browser downloads - some files not being noticed at all or being noticed multiple times. We switched to [Hazel][5] which not only sidestepped the bugs we were seeing, but also provided me with better insight if something failed.
0. A bit later I discovered the `OutputFile` attribute of the [DAZzle spec][6] which would allow me to capture the results of the printed postage. By using and monitoring a different file extension for the output, I updated the script to parse the results and post the confirmation code to the website. This became an immense timesaver since it would allow postage to be queued instead of having to wait to paste each confirmation code manually. We used this approach for a long time.
Eventually we needed to do more than just printing postage. The Hazel setup was straightforward, but the AppleScript implementation had become a bit too complex and inconvenient to test and change. We also needed this setup to be easily deployed on multiple systems. At this point I decided to spend some time coming up with a different solution which would better meet our needs.
## The Bot
Today's bot operates a bit differently. Rather than depending on monitored folders for file downloads, each workstation has its own queue (via [Amazon SQS][8]). Rather than complex logic in AppleScript, it is primarily based in PHP (as a [Phar][13]). Rather than Hazel managing processes, [launchd][9] typically runs it as an agent daemon. Rather than only printing shipping labels, it helps with several different tasks. Here are some of them...
**Printing Postage** - the long-lived task of printing postage. The server pushes a resource URL which has the DAZzle XML data with address/contents/weights, the task gets the resource and sends it to Endicia, and then, once finished, it pushes the results back to the server where shipment costs and confirmation codes get extracted to update the order.
**Purchasing Postage** - Endicia uses an account balance when printing postage, so whenever it gets low we need to reload it. Typically this requires user intervention since they don't support automatic reloading, but this task runs through the menus and dialogs with AppleScript ([discussed here][21]) to avoid any real interruptions. Whenever the system notices the balance getting low, it automatically sends this task to a capable workstation.
**Archiving the Mailing Log** - Endicia keeps track of the postage it prints/buys/refunds in a mailing log. Over time this grows and slows things down, so Endicia provides an option to archive the log. Normally this is a manual process, but this task automates it. In addition to archiving, it also takes care of uploading the log to an encrypted S3 bucket where a server process can later go through to reconcile the transactions. A scheduler regularly sends this task to workstations running Endicia.
**Label Printing** - another task we need to manage is printing labels for inventory through [DYMO Label][15]. The labels use a QR code ([discussed here][14]) and may include price and other product information. The server pushes a resource URL which has the XML-based label template appropriate for the product, embedding the product/inventory details. The task then downloads the label file to a temporary location and uses AppleScript to open it, printing however many copies are requested.
**Webcam** - in addition to the [virtual tour][16] of the shop, we also have a [public webcam][17]. The webcam software supports sending snapshots to a URL endpoint on a timed interval, but it doesn't support SSL/TLS connections. As a workaround, this task takes care of downloading the snapshot as JPG and then uploading it securely to the correct endpoint. A scheduler is responsible for pushing this task to a server at the shop during business hours.
**Printing** - a more recent experiment is for remotely printing regular documents. Sometimes the system sends emails to the staff when they need to reprint documents (such as pricing signs, pull details, or inventory locations). Rather than waiting for someone to see those emails and manually print them, I'm hoping the documents can just be waiting in the printer in the mornings for an elf to quickly pick up and handle.
**User Dialog** - sometimes there are one-off tasks which need interaction. For example, letting the user know if Endicia is having confirmed service issues where we need to wait on printing more shipping labels.
**Automatic Updates** - another more recent development is automatic updates. Historically I used read-only deployment keys and manually deployed the full repository to workstations. This was problematic on older machines since it needed `git`. Instead, I've started deploying Phars, creating them with [box][18] and publishing a versions manifest ([example][20]) for the [php-phar-update][19] component. Whenever it's convenient for the workstation, I can push the update task and let it self-update and restart.
## From the Web
From the server side of things, it maintains a hard-coded mapping of workstations and their available tasks. Whenever multiple workstations can handle a particular task, an extra field is presented to the user so they can pick where it should happen (defaulting to their own).
<img alt="Screenshot: Print To selection" src="{{ site.asset_prefix }}/blog/2015-02-21-sending-work-from-a-web-application-to-desktop-applications/print-to-interface.jpg" width="480" />
Whenever the app needs to send a task to a bot, it queues a JSON object where the key is the task name and the its value is the task options. For example, the payload for purchasing new postage looks like:
{% highlight json %}
{ "endicia.purchase_postage": {
"amount": 500 } }
{% endhighlight %}
## Conclusion
PHP probably isn't most people's first thought for this sort of solution - there isn't any hypertext involved, after all. But since I didn't have to abuse PHP to fit here, and since it's a language I'm very productive with, it was the most efficient route to solving my problems. It has taken a few experiments to get to this point, but over the past ~2 years this queueing/PHP-based approach has been working out very well for us on the ~6 systems it runs on.
Although it probably doesn't make much sense for others, I recently cleaned up and open sourced the bot portion of the code that I've been using for this. The [elfbot][12] repository has most of the tasks, an example configuration, and a compiled Phar in the releases. Maybe you'll find something interesting.
[1]: https://www.usps.com/
[2]: http://www.endicia.com/
[3]: http://mac.endicia.com/extras/urls/
[4]: http://mac.endicia.com/extras/applescript/
[5]: http://www.noodlesoft.com/hazel.php
[6]: http://mac.endicia.com/extras/xml/
[8]: http://aws.amazon.com/sqs/
[9]: https://developer.apple.com/library/mac/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingLaunchdJobs.html
[10]: https://www.theloopyewe.com/
[11]: https://www.theloopyewe.com/sheri/2008/08/the-loopy-elves-in-the-loopy-limelight
[12]: https://github.com/theloopyewe/elfbot
[13]: http://php.net/manual/en/book.phar.php
[14]: /blog/2014/01/13/barcoding-inventory-with-qr-codes.html
[15]: http://www.dymo.com/en-US
[16]: https://www.theloopyewe.com/about/loopy-central/fort-collins
[17]: https://www.theloopyewe.com/about/loopy-central/webcam/
[18]: http://box-project.org/
[19]: https://github.com/herrera-io/php-phar-update
[20]: https://theloopyewe.github.io/elfbot/versions.json
[21]: /blog/2013/01/28/scripting-endicia-to-purchase-postage.html

View File

@@ -1,75 +0,0 @@
---
title: "Parsing Microdata in PHP"
layout: "post"
tags: [ "microdata", "opensource", "php", "schema", "xpath" ]
description: "Open sourcing a library to easily traverse HTML for microdata."
code: https://github.com/dpb587/microdata-dom.php
---
A couple years ago I wrote about how I was [adding microdata][3] to [The Loopy Ewe][1] website to annotate things like products, brands, and contact details. I later wrote about how the internal search engine [depended on that microdata][4] for search results. During development and the initial release I was using some basic [XPath][2] queries, but as time passed the implementation became more fragile and incomplete. Since then, the parser has gone through several refactorings and this week I was able to extract it into a separate library that I can [open source][9].
## Implementation
My original implementation was a single helper class with a confusing mix of recursion, loops, and values by reference. The helper would receive the HTML string to parse and it would return a complex array with self-referencing values for multi-level scopes. Looking for a more reliable data structure to pass around, I decided to switch and extend the [`DOMDocument`][5]. I spent some time reading the [HTML Microdata][6] spec and wanted to try and find a balance between the spec's [DOM API][7] and existing PHP conventions.
Now I use the library's [`MicrodataDOM\DOMDocument`][8] class when I want to parse a microdata document. It works just like the built-in `DOMDocument` so I'm able to manage libxml errors, control how I import the HTML document, and pass it through methods which are expecting a regular `DOMDocument`. The key difference is the addition of a `getItems` method which lets me quickly retrieve the microdata items. Internally, `getItems` and subsequent calls are still using XPath queries.
In addition to extending `DOMDocument`, the library also extends `DOMElement`. This way, `getItems` is just returning a regular (but still specialized) list of DOM elements. The extended element class provides access to the microdata attributes like type, property name, and value.
## Usage
It's works like a low-level library, expecting other, more specialized classes to add their own friendlier methods on top. Here's the example I used in the readme...
{% highlight javascript %}
<?php
$dom = new MicrodataDOM\DOMDocument();
$dom->loadHTMLFile('http://dpb587.me/about.html');
// find Person types and get the first item
$dpb587 = $dom->getItems('http://schema.org/Person')->item(0);
echo $dpb587->itemId;
// items are still regular DOMElement objects
printf(" (from %s on line %s)\n", $dpb587->getNodePath(), $dpb587->getLineNo());
// there are a couple ways to access the first value of a named property
printf("givenName: %s\n", $dpb587->properties['givenName'][0]->itemValue);
printf("familyName: %s\n", $dpb587->properties['familyName']->getValues()[0]);
// or directly get the third, property-defining DOM element
$property = $dpb587->properties[3];
printf("%s: %s\n", $property->itemProp[0], $property->itemValue);
// use the toArray method to get a Microdata JSON structure
echo json_encode($dpb587->toArray(), JSON_UNESCAPED_SLASHES) . "\n";
{% endhighlight %}
Which will output something like...
http://dpb587.me/ (from /html/body/article/section on line 97)
givenName: Danny
familyName: Berger
jobTitle: Software Engineer
{"id":"http://dpb587.me/","type":["http://schema.org/Person"],"properties":{"givenName":["Danny"],...snip...}
In addition to using it for the internal search, I've been using this library for other internal tools responsible for sanitizing, normalizing, and taking care of some validation during development and testing. Hopefully I'll be able to extract and open-source those features sometime as well.
## Summary
Back when I first started this, I couldn't find any good libraries for this sort of microdata parsing. Nowadays it looks like there's at least [one other project][10] which I would consider if I didn't already have an implementation. With bias, I do still favor mine because of the unit tests, `itemprop` properties implementation, and a bit closer mirroring of how the spec describes interacting with a microdata API.
[1]: https://www.theloopyewe.com/
[2]: http://php.net/manual/en/class.domxpath.php
[3]: /blog/2013/05/13/structured-data-with-schema-org.html
[4]: /blog/2013/06/01/search-engine-based-on-structured-data.html
[5]: http://php.net/manual/en/class.domdocument.php
[6]: http://www.w3.org/TR/microdata/
[7]: http://www.w3.org/TR/microdata/#microdata-dom-api
[8]: https://github.com/dpb587/microdata-dom.php/blob/master/src/MicrodataDOM/DOMDocument.php
[9]: https://github.com/dpb587/microdata-dom.php
[10]: https://github.com/linclark/MicrodataPHP

View File

@@ -1,23 +0,0 @@
---
title: "New BOSH Release for OpenVPN"
layout: "post"
tags: [ "bosh", "openvpn" ]
description: "Open sourcing a new BOSH release for managing an OpenVPN network."
code: https://github.com/dpb587/openvpn-boshrelease
---
I'm a big fan of [OpenVPN][1] - both for personal and professional VPNs. Seeing as how I've been deploying more things with [BOSH][2] lately, an OpenVPN release seemed like a good little project. I started one about nine months ago and have been using development releases ever since, but last week I went ahead and created a ["final" release][6] of it.
There is only a single job (`openvpn`) and the properties are [well documented][3]. Its primary purpose is to act as a server for other clients to connect to, however you can also configure it to connect as a client and connect to another OpenVPN network as well. This makes it very easy to join multiple networks from a single OpenVPN connection.
One of the more complicated steps of configuring an OpenVPN server is figuring out and remembering the correct commands for creating and signing security keys and certificates. The [README][4] includes all those steps to get a server running in a deployment and a client connected to it. There are also a few other examples about some fancier configuration options such as: setting up `iptables` for shared networks, allowing VPN clients to communicate with each other, and making sure specific clients are assigned static IPs.
After going through the process of setting up quite a few OpenVPN servers and trying to automate and maintain them, this BOSH release has become my preferred method given its flexibility, consistency, and handy readme so I'm no longer Googling at every step. Check out the [project page][5] if you'd like to learn more, or see the [releases][6] page there for a tarball that you can use in your own BOSH environment.
[1]: https://openvpn.net/
[2]: http://bosh.io/
[3]: https://github.com/dpb587/openvpn-boshrelease/blob/89fd58982db3327e26cb8e2b9ed06835ffb08dd1/jobs/openvpn/spec#L17
[4]: https://github.com/dpb587/openvpn-boshrelease/blob/master/README.md
[5]: https://github.com/dpb587/openvpn-boshrelease
[6]: https://github.com/dpb587/openvpn-boshrelease/releases

View File

@@ -1,81 +0,0 @@
---
title: "Using nginx to Reverse Proxy and Cache S3 Objects"
layout: "post"
tags: [ "aws", "aws-s3", "caching", "nginx", "reverse-proxy", "s3", "upstream" ]
description: "Using S3 as an upstream server for improving long-tail traffic."
---
My most recent project for [TLE][1] has been focused on making the infrastructure much more "cloud-friendly" and resilient to failures. One step in the project was going to require that more than one version might be running at a given time (typically just while a new version is still being rolled out to servers). The application itself doesn't have an issue with that sort of transition period, however, the way we were handling static assets (like stylesheets, scripts, and images) was going to cause problems. First, some background...
When the frontend application code gets built and packaged up, it only contains the static assets for its own version. The static assets get dumped into `/docroot/static/{hash}/`, where the hash is generated based on when they were last modified and build runtime details. Once the application gets deployed and symlinked live, the old versions are no longer accessible from the document root. This obviously has implications like:
0. Late requests for those old assets result in 404s (infrequently users, usually bots).
0. Application servers must be reloaded onto the new version at the same time (otherwise, an old server without the new assets might be used by the proxy).
Additionally, we use [CloudFront][2] as a CDN for those static assets with our website configured as the origin. If the CDN gets back a 404 for an asset (old or new) it is cached for a short period and potentially affects a lot of clients (particularly bad if it happens on the upcoming, new version). Since CloudFront supports [S3][4] buckets as origins, I figured we could use it to store all the versions of our static assets. I quickly added a step to the deployment process which uploads new assets to a bucket. However, that was only part of the solution.
Unfortunately, CloudFront doesn't support dynamic [gzip][5] compression - it will only send back, byte-for-byte, what the origin delivers and we were storing the plain, non-gzipped versions in S3. The options were to...
0. no longer provide the files in gzip form (bad option... some files are genuinely large);
0. store both plain and gzip versions in separate S3 objects, then change the web application to dynamically rewrite the `link`/`script`/URLs based on browser headers (a lot of work, fragile, and bad use of existing web standards); or
0. continue using our website as the origin where responses could correctly be `Vary`'d and conditionally compressed.
The last one was definitely my preferred choice, but we would still have the problem of a single version being on the filesystem and unpredictable results when multiple application server versions were running behind the proxy. After some thought, I wanted to try using the S3 bucket as an upstream and avoiding the application servers altogether. And to improve latency and minimize the external, S3 requests I could cache them locally. After some experimentation, I ended up with something like the following in our [nginx][3] configs...
location /static/ {
# we can only ever GET/HEAD these resources
limit_except GET {
deny all;
}
# cookies are useless on these static, public resources
proxy_ignore_headers set-cookie;
proxy_hide_header set-cookie;
proxy_set_header cookie "";
# avoid passing along amazon headers
# http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
proxy_hide_header x-amz-delete-marker;
proxy_hide_header x-amz-id-2;
proxy_hide_header x-amz-request-id;
proxy_hide_header x-amz-version-id;
# only rely on last-modified (which will never change)
proxy_hide_header etag;
# heavily cache results locally
proxy_cache staticcache;
proxy_cache_valid 200 28d;
proxy_cache_valid 403 24h;
proxy_cache_valid 404 24h;
# s3 replies with 403 if an object is inaccessible; essentially not found
proxy_intercept_errors on;
error_page 403 =404 /_error/http-404.html;
# go get it from s3
proxy_pass https://s3-us-west-1.amazonaws.com/example-static-bucket$1;
# annotate response about when it was originally retrieved
add_header x-cache '$upstream_cache_status $upstream_http_date';
# heavily cache results downstream
expires max;
}
So, with the above configuration...
* CloudFront still points to our website and we can serve gzip/plain at the same resource;
* assets are kept around indefinitely (and we could utilize bucket lifecycle policies if it becomes an issue);
* frontend web server no longer relies on a particular application server's filesystem;
* access to the S3 bucket/prefix can be restricted via bucket policy; and
* most importantly... deployment timing is no longer critical - versions can be deployed at whatever pace is appropriate and possible.
Since deploying these changes over a month ago, everything has been working very well and the number of static 404 nuissances in our error logs have dropped significantly. It also made it much easier to move onto the next problem on the path to cloud-friendliness and resiliency...
[1]: https://www.theloopyewe.com/
[2]: http://aws.amazon.com/cloudfront/
[3]: http://nginx.org/
[4]: http://aws.amazon.com/s3/
[5]: https://en.wikipedia.org/wiki/Gzip

View File

@@ -1,137 +0,0 @@
---
title: "Self-Upgrading Packages in BOSH Releases"
layout: "post"
keywords: [ "bosh", "package manager", "updates", "upgrades", "versions" ]
description: "A strategy for monitoring upstream dependencies for self-sustaining packages."
---
Outside of [BOSH][1] world, package management is often handled by tools like [yum][2] and [apt][3]. With those tools, you're able to run trivial commands like `yum info apache2` to check the available versions or `yum update apache2` to upgrade to the latest version. It's even possible to automatically apply updates via cron job. With BOSH, it's not nearly so easy since you must monitor upstream releases, manually downloading the sources before moving on to testing and deploying. Personally, this repetitive sort of maintenance is one of my least favorite tasks; so, to avoid it, I started automating.
## Automating
There are two critical steps involved with sort of thing. First is being able to `check` when new versions are available. For this post, I'll use my [OpenVPN BOSH Release][9] which has a single package with three dependencies. For each dependency, I can use commands to check for the latest version...
# lzo
$ wget -q -O- http://www.oberhumer.com/opensource/lzo/download/ | grep -E 'href="lzo-[^"]+.tar.gz"' | sed -E 's/^.+href="lzo-([^"]+).tar.gz".+$/\1/' | gsort -rV | head -n1
2.09
# openssl
$ git ls-remote --tags https://github.com/openssl/openssl.git | cut -f2 | grep -Ev '\^{}' | grep -E '^refs/tags/OpenSSL_.+$' | sed -E 's/^refs\/tags\/OpenSSL_(.+)$/\1/' | tr '_' '.' | grep -E '^\d+\.\d+\.\d+\w*$' | gsort -rV | head -n1
1.0.2d
# openvpn
$ git ls-remote --tags https://github.com/OpenVPN/openvpn.git | cut -f2 | grep -Ev '\^{}' | grep -E '^refs/tags/v.+$' | sed -E 's/^refs\/tags\/v(.+)$/\1/' | tr '_' '.' | grep -E '^\d+\.\d+\.\d+$' | gsort -rV | head -n1
2.3.7
The location to download the source for a dependency is typically predictable, once the pattern is known...
$ wget -O lzo.tar.gz "http://www.oberhumer.com/opensource/lzo/download/lzo-${VERSION}.tar.gz"
Within the release, files become structured like:
./blobs/openvpn-blobs/
./lzo/
lzo.tar.gz
./openssl/
openssl.tar.gz
./openvpn/
openvpn.tar.gz
./packages/openvpn/
./deps/
./lzo/
./check
./get
./VERSION
./openssl/
./check
./get
./VERSION
./openvpn/
./check
./get
./VERSION
./packaging
./spec
Each dependency has its own blob directory, allowing old versions to be fully removed before replacing it with the new version's file(s). Inside the package directory, `VERSION` is a committed state file used for comparison in version checks. It can also be used to quickly reference and document what versions are being used...
$ find packages -name VERSION | xargs -I {} -- /bin/bash -c 'A={} ; printf "%12s %s/%s\n" $( cat $A ) $( basename $( dirname $( dirname $( dirname $A ) ) ) ) $( basename $( dirname $A ))'
2.09 openvpn/lzo
1.0.2d openvpn/openssl
2.3.7 openvpn/openvpn
One side effect of this structure is that the `packaging` script and `spec` manifest should be version agnostic. Otherwise you still end up needing to tweak them every time a version changes, defeating the automation. In `packaging`, references such as `openssl-1.0.2d` would typically become `openssl-*`. In `spec`, the `files` property is minimal...
---
name: "openvpn"
files:
- "openvpn-blobs/**/*"
When it comes time to upgrade dependencies I can run a [utility script][5]...
$ ./bin/deps-upgrade-auto
==> openvpn/lzo
--| local 2.09
--| check 2.09
==> openvpn/openssl
--| local 1.0.1m
--| check 1.0.2d
--> fetching new version
--> 5.1M
==> openvpn/openvpn
--| local 2.3.6
--| check 2.3.7
--> fetching new version
--> 1.1M
The script runs through all the dependencies, uploads new blobs to the blobstore, and commits the changes with a nice summary...
$ git log --format=%B -n1
Upgraded 2 package dependencies
openvpn
* openssl now 1.0.2d (was 1.0.1m)
* openvpn now 2.3.7 (was 2.3.6)
At this point, I have a single command that I can run to check and upgrade dependencies in all my packages. This openvpn example is fairly trivial, but some packages are much more complicated with many more dependencies from separate sites and using separate versioning and download strategies.
## Continuous Integration
Of course, upgrades aren't always without issue, which is why it's important to integrate it with existing tests and Continuous Integration pipelines. Consider the following workflow:
* weekly, CI runs `deps-upgrade-auto` off the `master` branch, pushing new versions to `master-autoupgrade`
* CI monitors `master-autoupgrade` for new commits, and follows the typical development pipeline
* it creates a new development release version (i.e. `bosh create release`)
* it creates a new test deployment with the version and test data
* it runs unit tests and errand tests against the deployment
* based on what happens to this version-testing branch...
* *on-success*: send a Pull Request for a human to review and merge (or, assuming you have quality tests, go ahead and merge it automatically)
* *on-failure*: create an issue in the repo listing the dependency versions which changed and information about the failed step so that a human can intervene with a headstart on where they need to start investigating
This sort of pipeline results in...
* best case scenario - a bot sends me a PR with upgraded dependencies which have been tested and confirmed to work in my release and I can click "Merge"
* worst case scenario - a bot tells me I should upgrade OpenSSL but I need to investigate an issue where OpenVPN client connects are now failing a TLS handshake
## Conclusion
These `check`/`get`-type scripts and the self-upgrading approach is something I've been using in my releases lately. The value for me comes from the inherent documentation it provides, but mainly it's from being able to offload some of the maintenance burdens I normally need to be concerned about. Although I have yet to fully implement the steps from the [CI section](#continuous-integration) into my [Concourse][8] pipelines, I hope to get there at some point soon.
If you're interested in experimenting with the scripts from this post, you can find them in [this gist][7] along with a few other `check` scripts I've been using. You can also take a look at the commits in the OpenVPN BOSH Release where I [switched][10] to using `deps` and then subsequently [auto-upgraded][11] the dependencies.
[1]: https://bosh.io/
[2]: https://en.wikipedia.org/wiki/Yellowdog_Updater,_Modified
[3]: https://wiki.debian.org/Apt
[4]: https://openvpn.net/
[5]: https://gist.github.com/dpb587/e2d955f00378c1b78ea2#file-bin-deps-upgrade-auto-sh
[6]: http://php.net/
[7]: https://gist.github.com/dpb587/e2d955f00378c1b78ea2
[8]: http://concourse.ci/
[9]: https://github.com/dpb587/openvpn-boshrelease
[10]: https://github.com/dpb587/openvpn-boshrelease/commit/26f115dfd5d80444fee543e17edf198e7d15b485
[11]: https://github.com/dpb587/openvpn-boshrelease/commit/ac833f99cb361b0cb7fb39d70b70a0403ba87af8

View File

@@ -1,43 +0,0 @@
---
title: "Pruning Blobs from BOSH Releases"
layout: "post"
keywords: [ "blobs", "bosh", "cleanup", "packages", "pruning" ]
description: "Avoiding unnecessary disk usage for old, unneeded package files."
---
Over time, as blobs are continually added to [BOSH][1] releases, the files can start consuming lots of disk space. Blobs are frequently abandoned because newer versions replace them, or sometimes the original packages referencing them are removed. Unfortunately, freeing the disk space isn't as simple as `rm blobs/elasticsearch-1.5.2.tar.gz` because BOSH keeps track of blobs in the `config/blobs.yml` file and uses symlinks to cached copies.
To help keep a lean workspace, I remove references to blobs which are no longer needed in my release. The blobs remain untouched in the blobstore/S3, but as far as my local `bosh` command cares about, it doesn't need to keep local copies. One option for pruning is to manually edit `config/blobs.yml` and remove the old references (and then run `bosh sync blobs` to update `blobs/`). However, I tend to go the other direction - interactively or with shell scripts - removing files from `blobs/` and then updating `blobs.yml` with this command...
for FILE in $( grep -E '^[^ ].+:$' config/blobs.yml | tr -d ':' ) ; do
[ -e "blobs/${FILE}" ] || sed -i '' -E -e "\\#^${FILE}:\$#{N;N;N;d;}" config/blobs.yml
done
Once they're gone from `blobs.yml` I can commit the changes and know that the next time I need to clone/sync into a new workspace it'll be faster.
git commit -om 'Prune old blob references' config/blobs.yml
But... while those blobs are no longer listed in `config/blobs.yml` and they are no longer in `blobs/`, the blob still exists in the `.blobs` directory where `bosh` keeps an original copy. I can remove unreferenced blobs from `.blobs` with this command...
for BLOBSHA in $( find .blobs -type f ) ; do
grep -qE "^ sha:\s+$( basename $BLOBSHA )" config/blobs.yml || rm "$BLOBSHA"
done
Even though the blobs are now effectively gone, their references still exist in repository history. For example, if you wanted to rebuild your `.blobs` cache directory you could loop through changes to `blobs.yml` and rerun `bosh sync blobs` to restore local copies...
for COMMIT in $( git rev-list --parents HEAD -- config/blobs.yml | cut -d" " -f1 ; git rev-parse HEAD ) ; do
git checkout "$COMMIT" config/blobs.yml
bosh sync blobs
done
As an example, here's a before and after of cleaning up blobs in my long-running [logsearch-boshrelease][2] workspace...
$ du -sh .blobs/ | cut -f1
904M
...cleanup...
$ du -sh .blobs/ | cut -f1
168M
[1]: http://bosh.io/
[2]: https://github.com/logsearch/logsearch-boshrelease

View File

@@ -1,158 +0,0 @@
---
title: "Tempore limites: BOSH Veneer"
layout: "post"
keywords: [ "bosh", "browser", "frontend", "user interface" ]
description: "Experimenting with a browser frontend to working with BOSH."
---
For all the low-level handling of things, BOSH is a good tool for system administration. But when it comes to
configuring everything, I think it leaves something to be desired for the average Joe. Opening my text editor, making
changes to the YAML, copying and pasting security groups from AWS Console, `git diff`ing to make sure I did what I
think, `git commit`ing in case things go bad, `bosh deploy`ing to make it so... it can become quite the process. For me,
I'm much more a visual person and prefer a browser-based tool. Since I've had a bit extra free-time lately, I've spent
some time experimenting on ideas to help improve my BOSH-quality-of-life.
## BOSH
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/core-login.png"><img class="iright" alt="Screenshot: core-login" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/core-login.png" /></a>
The `bosh` CLI can work with multiple directors and uses the `target` command to switch between instances. With a
browser-based tool, I just need to browse to the director or whatever dedicated instance I've deployed the release to.
From there, I login with my credentials as I would with `bosh login`.
While working with the project, I've been referring to it as "veneer", as in "a thin decorative covering of fine wood
applied to a coarser wood or other material."
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/bosh-release-version.png"><img class="ileft" alt="Screenshot: bosh-release-version" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/bosh-release-version.png" /></a>
One of the core features is to simply provide browser-based pages to view BOSH resources. For example, it's easy to see
the list of releases and details about specific release versions. This makes the release and configuration process much
more discoverable to end users. The screenshot shows details about the logsearch release, something which I deploy
alongside all deployments to collect logs and metrics.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/bosh-deployment-vm.png"><img class="iright" alt="Screenshot: core-login" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/bosh-deployment-vm.png" /></a>
Of course, the most common BOSH resource is deployments. I can quickly pull up a specific VM to see what's installed and
how it is configured in the cloud. Since I'm using the AWS CPI, an extra link is shown on the side which links directly
to the instance in AWS Console. Further down on that page is a section which describes the persistent disk on the VM.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/bosh-deployment-job-disk.png"><img class="ileft" alt="Screenshot: bosh-deployment-job-disk" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/bosh-deployment-job-disk.png" /></a>
The AWS component of veneer knows the various CloudWatch metrics which are available for instances and disks. Here the
persistent disk metrics are shown, including timing, queue length, and idle time below. This allows me to quickly pull
up graphs if I'm trying to investigate an issue. If I do need to diagnose further in AWS Console, the sidebar link will
take me straight to the EBS Volume there.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/bosh-deployment-job.png"><img class="iright" alt="Screenshot: bosh-deployment-job" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/bosh-deployment-job.png" /></a>
I mentioned I included logsearch alongside all my deployments. Similar to veneer's AWS component, I also have a
logsearch component which advertises many internal metrics for the BOSH resources. Here, on a specific job, I can
quickly see load and memory usage over the past few hours. I can hover over the chart for specific values, or click
into the graph to change the time span, granularity, and statistical method used.
## Marketplace
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/marketplace-home.png"><img class="iright" alt="Screenshot: marketplace-home" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/marketplace-home.png" /></a>
One of the reasons I like BOSH is because I can use releases from both the open-source community, but also my own
internally built releases. The marketplace component provides that central view into the various sources where I can
pull my releases and stemcells from. For example, `theloopyewe` marketplace enumerates a private S3 bucket using a regex
to identify artifacts and their release name/version. Of course, the `boshio` one scrapes and uses the API to pull down
the public [bosh.io][4] resources.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/marketplace-stemcells.png"><img class="ileft" alt="Screenshot: marketplace-stemcells" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/marketplace-stemcells.png" /></a>
From bosh.io, I can easily view the list of stemcells which are available. There are many more stemcells than I actually
use from a single director, so the checkmark helps me identify which one(s) I have already uploaded to the director. If
I want to see the full list of versions, I click on the name for a similar view. Versions which follow [semver][5] are
parsed to provide intelligent advice about whether deployments are up to date in their release and stemcell usage.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/marketplace-stemcell-upload.png"><img class="iright" alt="Screenshot: marketplace-stemcell-upload" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/marketplace-stemcell-upload.png" /></a>
When viewing a specific stemcell version, I get a quick summary and, if it's not already installed, I have the option to
upload it to the director right on-screen. Assuming the director has internet access, I can click "Upload" where the
task will be started and I get redirected to the task detail page. The release version page is similar.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/marketplace-stemcell-task.png"><img class="ileft" alt="Screenshot: marketplace-stemcell-task" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/marketplace-stemcell-task.png" /></a>
The task page automatically updates until it has completed successfully at which point it'll redirect me to the main
stemcell summary page indicating it was completed. If an error occurs, it'll show me the full error and wait for me to
diagnose and figure out a resolution myself.
## Operations
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor.png"><img class="iright" alt="Screenshot: ops-deployment-editor" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor.png" /></a>
I've mentioned the BOSH, AWS, Logsearch, and Marketplace component, but the most intriguing component is Operations.
This component handles more of the management tasks, most notably, editing deployment configuration. It provides the
core forms for deployment manifests, but it also imports the forms that the CPI-specific component provides.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-resourcepool.png"><img class="ileft" alt="Screenshot: ops-deployment-editor-resourcepool" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-resourcepool.png" /></a>
For example, the Cloud Properties section of the resource pool uses the AWS-specific form including Instance Type, but
also properties like Availability Zone and ELB Names below the fold. You can also see the stemcell field is
intelligently populated based on the stemcell names and versions which are installed on the director.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-job.png"><img class="iright" alt="Screenshot: ops-deployment-editor-job" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-job.png" /></a>
Editing a job is also straightforward - it references the resource and disk pools already configured in the manifest so
they're easy to select. The templates are also enumerated based on which releases are already configured in the manifest
and installed on the director. The forms also clearly indicate which properties are required vs which are optional
(since there are often more properties available than are typically needed).
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-properties.png"><img class="ileft" alt="Screenshot: ops-deployment-editor-properties" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-properties.png" /></a>
Properties are another piece of deployments which are frequently changed. Here, properties are enumerated based on which
releases and templates are referenced in the deployment manifest. A green plus on the right indicates the property is
not currently set, while a blue pencil button shows a setting is currently set.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-property.png"><img class="iright" alt="Screenshot: ops-deployment-editor-property" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-property.png" /></a>
If I do want to change the property, a simple form comes up where I can input my YAML-friendly value. If the release's
job spec provides the metadata, the help information includes description and example information.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-pending.png"><img class="ileft" alt="Screenshot: ops-deployment-editor-pending" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/ops-deployment-editor-pending.png" /></a>
When changes have been saved, they are not immediately sent to the director. This allows multiple changes to be made and
then deployed at a coordinated time. It's important not to forget changes though, so the banner provides a reminder that
changes are pending and provides a link to compare the changes before applying them.
## Core
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/core-repo-deployment.png"><img class="iright" alt="Screenshot: core-repo-deployment" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/core-repo-deployment.png" /></a>
I mentioned changes are not immediately applied, and this is because they are actually written to new branch in the git
repository where everything is maintained. The git repository provides the support for versioning and merging - when
clicking the Review button, it's actually just showing an intelligent `diff` between `master` and the drafted branch.
<a href="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/core-repo-deployment-arbitrary.png"><img class="ileft" alt="Screenshot: core-repo-deployment-arbitrary" src="{{ site.asset_prefix }}/blog/2015-11-12-tempore-limites-bosh-veneer/core-repo-deployment-arbitrary.png" /></a>
Similarly, as a git repository it can be cloned over HTTPS from veneer for backup purposes or advanced editing and then
even pushed back. This makes veneer more of a tool which can function alongside other infrastructure tools which also
commit their configurations. For example, in the earlier photo you'll see `cloudformation.json` templates - something
which I currently manage externally yet can still reference from my deployment manifests using pre-processing
capabilities that veneer provides.
## Summary
For enterprisey-types, I've heard there's such a thing as [Ops Manager][1] which helps provide a bit of a frontend for
deploying [certain software][2] (like [Cloud Foundry][3]). I'm not quite an enterprisey-type and don't have an
enterprisey budget, but I still appreciate having shiny tools where I can point my browser to manage, monitor, and
cross-reference my technical resources.
Since my extra free-time is coming to a close as I move on to another chapter in my life, this project will sit on my
backburner. I still like the features and ideas though, so I figured I could write a post summarizing some of them. At
the very least, if you encounter a project with similar features leave a comment - I'd love to use it myself!
[1]: http://docs.pivotal.io/pivotalcf/customizing/
[2]: https://network.pivotal.io/
[3]: https://www.cloudfoundry.org/
[4]: https://bosh.io/
[5]: http://semver.org/