add search-engine-based-on-structured-data post

2013-06-01 18:49:56 -06:00
parent da793dda54
commit 488726a842
1 changed files with 315 additions and 0 deletions
--- a/blog/_posts/2013-06-01-search-engine-based-on-structured-data.md
+++ b/blog/_posts/2013-06-01-search-engine-based-on-structured-data.md
@@ -0,0 +1,315 @@
+---
+title: The Basics of a Custom Search Engine
+layout: post
+tags: elasticsearch gearmand schema.org search sitemap structured-data
+description: Combining elasticsearch and "structured data" to create a self-hosted search engine.
+---
+
+One of the most useful features of a website is the ability to search. [The Loopy Ewe][4] has had some form of faceted
+product search for a long time, but it has never had the ability to quickly find regular pages, categories, brands, blog
+posts and the like. [Google][1] seems to lead in offering custom search products with both [Custom Search Engine][2] and
+[Site Search][3], but they're either branded or cost a bit of money. Instead of investing in their proprietary products,
+I wanted to try to create a simple search engine for our needs which took advantage of my previous work in implementing
+existing open standards.
+
+
+### Introduction
+
+In my mind, there are four basic processes when creating a search engine:
+
+**Discovery** - finding the documents that are worthy of indexing. This step was fairly easy since I had already setup
+a [sitemap][6] for the site. Internally, the feature bundles of the site are responsible for generating their own
+sitemap (e.g. blog posts, regular content pages, photo galleries, products, product groups) and [`sitemap.xml`][10] just
+advertises them. So, for our purposes, the discovery step just involves reviewing those sitemaps to find the links.
+
+**Parsing** - understanding the documents to know what content is significant. Given my previous work of [implementing
+structured data][7] on the site and creating internal tools for reviewing the results, parsing becomes a very simple
+task.
+
+The next two processes are more what I want to focus on here:
+
+ * **Indexing** - ensuring the documents are accessible via search queries.
+ * **Maintenance** - keeping the documents updated when they are updated or removed.
+
+
+### Indexing
+
+We were already using [elasticsearch][8], so I was hoping to use it for full-text searching as well. I decided to
+maintain two types in the search index.
+
+
+#### Discovered Documents (`resource`)
+
+The `resource` type has all our indexed URLs and a cache of their contents. Since we're not going to be searching it
+directly, it's more of a basic key-based storage based on the URL. The mapping looks something like:
+
+{% highlight javascript %}
+{   "_id" : {
+        "type" : "string" },
+    "url" : {
+        "type" : "string",
+        "index" : "no" },
+    "response_status" : {
+        "type" : "string",
+        "index" : "no" },
+    "response_headers" : {
+        "properties" : {
+            "key" : {
+                "type" : "string",
+                "index" : "no" },
+            "value" : {
+                "type" : "string",
+                "index" : "no" } } },
+    "response_content" : {
+        "type" : "string",
+        "index" : "no" },
+    "date_retrieved" : {
+        "type" : "date",
+        "format" : "yyyy-MM-dd HH:mm:ss" },
+    "date_expires" : {
+        "type" : "date",
+        "format" : "yyyy-MM-dd HH:mm:ss" } }
+{% endhighlight %}
+
+The `_id` is simply a hash of the actual URL and used elsewhere. Whenever the discovery process finds a new URL, it
+creates a new record and queues a task to download the document. The initial record looks like:
+
+{% highlight javascript %}
+{
+    "_id" : "b48d426138096d66bfaa4ac9dcbc4cb6",
+    "url" : "/local/fling/spring-fling-2013/",
+    "date_expires" : "2001-01-01 00:00:00"
+}
+{% endhighlight %}
+
+Then the download task is responsible for:
+
+ 1. Receiving a URL to download;
+ 2. Finding the current `resource` record;
+ 3. Validating it against `robots.txt`;
+ 4. Sending a new request for the URL (respecting `ETag` and `Last-Modified` headers);
+ 5. Updating the `resource` record with the response and new `date_*` values;
+ 6. And, if the document has changed, queueing a task to parse the `resource`.
+
+By default, if an `Expires` response header isn't provided, I set the `date_expires` field to several days in the
+future. The field is used to find stale documents later on.
+
+
+#### Parsed Documents (`result`)
+
+The `result` type has all our indexed URLs which were parsed and found to be useful. The documents contain some
+structured fields which are generated by the parsing step. The mapping looks like:
+
+{% highlight javascript %}
+{   "_id": {
+        "type": "string" },
+    "url": {
+        "type": "string",
+        "index": "no" },
+    "itemtype": {
+        "type": "string",
+        "analyzer": "keyword" },
+    "image": {
+        "type": "string",
+        "index": "no" },
+    "title": {
+        "boost": 5.0,
+        "type": "string",
+        "include_in_all": true,
+        "position_offset_gap": 64,
+        "index_analyzer": "snowballed",
+        "search_analyzer": "snowballed_searcher" },
+    "keywords": {
+        "_boost": 6.0,
+        "type": "string",
+        "include_in_all": true,
+        "index_analyzer": "snowballed",
+        "search_analyzer": "snowballed_searcher" },
+    "description": {
+        "_boost": 3.0,
+        "type": "string",
+        "analyzer": "standard" },
+    "crumbs": {
+        "boost": 0.5,
+        "properties": {
+            "url": {
+                "type": "string",
+                "index": "no" },
+            "title": {
+                "type": "string",
+                "include_in_all": true,
+                "analyzer": "standard" } } },
+    "content": {
+        "type": "string",
+        "include_in_all": true,
+        "position_offset_gap": 128,
+        "analyzer": "standard" },
+    "facts": {
+        "type": "object",
+        "enabled": false,
+        "index": "no" },
+    "date_parsed" : {
+        "type" : "date",
+        "format" : "yyyy-MM-dd HH:mm:ss" }
+    "date_published" : {
+        "type" : "date",
+        "format" : "yyyy-MM-dd HH:mm:ss" } }
+{% endhighlight %}
+
+A few notes on the specific fields:
+
+ * `itemtype` - the generic result type in schema.org terms (e.g. Product, WebPage, Organization)
+ * `image` - a primary image from the page; it becomes a thumbnail on search results to make them more inviting
+ * `title` - usually based on the `title` tag or more-concise `og:title` data
+ * `keywords` - usually based on the keywords `meta` tag (the field is boosted because they're specifically targeted
+   phrases)
+ * `description` - usually the description `meta` tag
+ * `content` - any remaining useful, searchable content somebody might try to find something in
+ * `facts` - arbitrary data used for rendering more helpful search results; some common keys:
+    * `collection` - indicates there are multiple of something (e.g. product quantities, styles of a product)
+    * `product_model` - indicate a product model name for the result
+    * `brand` - indicate the brand name for the result
+    * `price`, `priceMin`, `priceMax` - indicate the price(s) of a result
+    * `availability` - for a product this is usually "in stock" or "out of stock"
+ * `date_published` - for content such as blog posts or announcements
+
+The `result` type is updated by the parse task which is responsible for:
+
+ 1. Receiving a URL to parse;
+ 2. Finding the current `resource` record;
+ 3. Run the `response_content` through the appropriate structured data parser;
+ 4. Extract generic data (e.g. title, keywords);
+ 5. Extract `itemtype`-specific metadata, usually for `facts`;
+ 6. Update the `result` record.
+
+For example, this parsed [product model][17] looks like:
+
+{% highlight javascript %}
+{   "url" : "/shop/g/yarn/madelinetosh/tosh-dk/",
+    "itemtype" : "ProductModel",
+    "title" : "Madelinetosh Tosh DK",
+    "keywords" : [ "tosh dk", "tosh dk yarn", "madelinetosh", "madelinetosh yarn", "madelinetosh tosh dk", "madelinetosh" ],
+    "image" : "/asset/catalog-entry-photo/17c1dc50-37ab-dac6-ca3c-9fd055a5b07f~v2-96x96.jpg",
+    "crumbs": [
+        {
+            "url" : "/shop/",
+            "title" : "Shop" },
+        {
+            "url" : "/shop/g/yarn/",
+            "title" : "Yarn" },
+        {
+            "url" : "/shop/g/yarn/madelinetosh/",
+            "title" : "Madelinetosh" } ],
+    "content" : "Hand-dyed by the gals at Madelinetosh in Texas, you'll find these colors vibrant and multi-layered. Perfect for thick socks, scarves, shawls, hats, gloves, mitts and sweaters.",
+    "facts" : {
+        "collection": [
+            {
+                "value" : 93,
+                "label" : "products" } ],
+        "brand" : "Madelinetosh",
+        "price" : "22.00" },
+    "_boost" : 4 }
+{% endhighlight %}
+
+
+#### Searching
+
+Once some documents are indexed, I can create simple searches with the [`ruflin/Elastica`][11] library:
+
+{% highlight php %}
+<?php
+$bool = (new \Elastica\Query\Bool())
+    ->addMust(
+        (new \Elastica\Query\Bool())
+            ->setParam('minimum_number_should_match', 1)
+            ->addShould(
+                (new \Elastica\Query\QueryString())
+                    ->setParam('default_field', 'keywords')
+                    /* ...snip... */ )
+            ->addShould(
+                (new \Elastica\Query\QueryString())
+                    ->setParam('default_field', 'title')
+                    /* ...snip... */ )
+            ->addShould(
+                (new \Elastica\Query\QueryString())
+                    ->setParam('default_field', 'content')
+                    /* ...snip... */ ) );
+
+/* ...snip... */
+
+$query = new \Elastica\Query($bool);
+{% endhighlight %}
+
+To easily focus specific matches in the `title` and `content` fields I can enable highlighting:
+
+{% highlight php %}
+<?php
+$query->setHighlight(
+    array(
+        'pre_tags' => array('<strong>'),
+        'post_tags' => array('</strong>'),
+        'fields' => array(
+            'title' => array(
+                'fragment_size' => 256,
+                'number_of_fragments' => 1 ),
+            'content' => array(
+                'fragment_size' => 64,
+                'number_of_fragments' => 3 ) ) ) );
+{% endhighlight %}
+
+
+### Maintenance
+
+A search engine is no good if it's using outdated or no-longer-existant information. To help keep content up to date, I
+take two approaches:
+
+**Time-based updates** - one of the reasons for the indexed `date_expires` field of the `resource` type is so an
+process can go through and identify documents which have not been updated recently. If it sees something is stale, it
+goes ahead and queues it for update.
+
+**Real-time updates** - sometimes things (like product availability) change frequently, impacting the quality of search
+results. Instead of waiting for time-based updates, I use event listeners to trigger re-indexing when it sees things
+inventory changes or product changes in an order.
+
+In either case, when a URL is discovered to be gone, the records from both `resource` and `result` are removed for the
+URL.
+
+
+#### Utilities
+
+Sometimes there are deploys where specific pages are definitely changing, or when a whole new sitemap is getting
+registered with new URLs. Instead of waiting for the time-based updates or cron jobs to run, I have these commands
+available for scripting:
+
+ * `search:index-rebuild` - re-read the sitemaps and assert the links in the `resource` index
+ * `search:index-update` - find all the expired resources and queue them for update
+ * `search:result-rerun` - force the download and parsing of a URL
+ * `search:sitemap-generate` - regenerate all registered sitemaps
+
+
+### Conclusion
+
+Starting with structured data and elasticsearch makes building a search engine significantly easier. Data and indexing
+makes it faster to show smarter [search results][16]. Existing standards like [OpenSearch][12] make it easy to extend
+the search from a web page into the [browser][15] and even third-party applications via [Atom][13] and [RSS][14] feeds.
+Local, real-time updates ensures search results are timely and useful. Even with the basic parsing and ranking
+algorithms shown here, results are quite accurate. It has been a beneficial experience to approach the website from the
+perspective of a bot, giving me a better appreciation of how to efficiently markup and market content.
+
+
+ [1]: http://www.google.com/
+ [2]: http://www.google.com/cse/all
+ [3]: http://www.google.com/enterprise/search/products_gss_pricing.html
+ [4]: http://www.theloopyewe.com/
+ [5]: http://schema.org/
+ [6]: http://www.sitemaps.org/
+ [7]: /blog/2013/05/13/structured-data-with-schema-org.html
+ [8]: http://www.elasticsearch.org/
+[10]: http://www.theloopyewe.com/sitemap.xml
+[11]: https://github.com/ruflin/Elastica/
+[12]: http://www.opensearch.org/Home
+[13]: https://www.theloopyewe.com/search/results.atom?q=spring+fling
+[14]: https://www.theloopyewe.com/search/results.rss?=spring+fling
+[15]: https://www.theloopyewe.com/search/opensearch.xml
+[16]: https://www.theloopyewe.com/search/?q=madelinetosh
+[17]: https://www.theloopyewe.com/shop/g/yarn/madelinetosh/tosh-dk/