From 488726a842c58b24bc6788f2e25fd8fad5d621f5 Mon Sep 17 00:00:00 2001 From: Danny Berger Date: Sat, 1 Jun 2013 18:49:56 -0600 Subject: [PATCH] add search-engine-based-on-structured-data post --- ...-search-engine-based-on-structured-data.md | 315 ++++++++++++++++++ 1 file changed, 315 insertions(+) create mode 100644 blog/_posts/2013-06-01-search-engine-based-on-structured-data.md diff --git a/blog/_posts/2013-06-01-search-engine-based-on-structured-data.md b/blog/_posts/2013-06-01-search-engine-based-on-structured-data.md new file mode 100644 index 0000000..e78b770 --- /dev/null +++ b/blog/_posts/2013-06-01-search-engine-based-on-structured-data.md @@ -0,0 +1,315 @@ +--- +title: The Basics of a Custom Search Engine +layout: post +tags: elasticsearch gearmand schema.org search sitemap structured-data +description: Combining elasticsearch and "structured data" to create a self-hosted search engine. +--- + +One of the most useful features of a website is the ability to search. [The Loopy Ewe][4] has had some form of faceted +product search for a long time, but it has never had the ability to quickly find regular pages, categories, brands, blog +posts and the like. [Google][1] seems to lead in offering custom search products with both [Custom Search Engine][2] and +[Site Search][3], but they're either branded or cost a bit of money. Instead of investing in their proprietary products, +I wanted to try to create a simple search engine for our needs which took advantage of my previous work in implementing +existing open standards. + + +### Introduction + +In my mind, there are four basic processes when creating a search engine: + +**Discovery** - finding the documents that are worthy of indexing. This step was fairly easy since I had already setup +a [sitemap][6] for the site. Internally, the feature bundles of the site are responsible for generating their own +sitemap (e.g. blog posts, regular content pages, photo galleries, products, product groups) and [`sitemap.xml`][10] just +advertises them. So, for our purposes, the discovery step just involves reviewing those sitemaps to find the links. + +**Parsing** - understanding the documents to know what content is significant. Given my previous work of [implementing +structured data][7] on the site and creating internal tools for reviewing the results, parsing becomes a very simple +task. + +The next two processes are more what I want to focus on here: + + * **Indexing** - ensuring the documents are accessible via search queries. + * **Maintenance** - keeping the documents updated when they are updated or removed. + + +### Indexing + +We were already using [elasticsearch][8], so I was hoping to use it for full-text searching as well. I decided to +maintain two types in the search index. + + +#### Discovered Documents (`resource`) + +The `resource` type has all our indexed URLs and a cache of their contents. Since we're not going to be searching it +directly, it's more of a basic key-based storage based on the URL. The mapping looks something like: + +{% highlight javascript %} +{ "_id" : { + "type" : "string" }, + "url" : { + "type" : "string", + "index" : "no" }, + "response_status" : { + "type" : "string", + "index" : "no" }, + "response_headers" : { + "properties" : { + "key" : { + "type" : "string", + "index" : "no" }, + "value" : { + "type" : "string", + "index" : "no" } } }, + "response_content" : { + "type" : "string", + "index" : "no" }, + "date_retrieved" : { + "type" : "date", + "format" : "yyyy-MM-dd HH:mm:ss" }, + "date_expires" : { + "type" : "date", + "format" : "yyyy-MM-dd HH:mm:ss" } } +{% endhighlight %} + +The `_id` is simply a hash of the actual URL and used elsewhere. Whenever the discovery process finds a new URL, it +creates a new record and queues a task to download the document. The initial record looks like: + +{% highlight javascript %} +{ + "_id" : "b48d426138096d66bfaa4ac9dcbc4cb6", + "url" : "/local/fling/spring-fling-2013/", + "date_expires" : "2001-01-01 00:00:00" +} +{% endhighlight %} + +Then the download task is responsible for: + + 1. Receiving a URL to download; + 2. Finding the current `resource` record; + 3. Validating it against `robots.txt`; + 4. Sending a new request for the URL (respecting `ETag` and `Last-Modified` headers); + 5. Updating the `resource` record with the response and new `date_*` values; + 6. And, if the document has changed, queueing a task to parse the `resource`. + +By default, if an `Expires` response header isn't provided, I set the `date_expires` field to several days in the +future. The field is used to find stale documents later on. + + +#### Parsed Documents (`result`) + +The `result` type has all our indexed URLs which were parsed and found to be useful. The documents contain some +structured fields which are generated by the parsing step. The mapping looks like: + +{% highlight javascript %} +{ "_id": { + "type": "string" }, + "url": { + "type": "string", + "index": "no" }, + "itemtype": { + "type": "string", + "analyzer": "keyword" }, + "image": { + "type": "string", + "index": "no" }, + "title": { + "boost": 5.0, + "type": "string", + "include_in_all": true, + "position_offset_gap": 64, + "index_analyzer": "snowballed", + "search_analyzer": "snowballed_searcher" }, + "keywords": { + "_boost": 6.0, + "type": "string", + "include_in_all": true, + "index_analyzer": "snowballed", + "search_analyzer": "snowballed_searcher" }, + "description": { + "_boost": 3.0, + "type": "string", + "analyzer": "standard" }, + "crumbs": { + "boost": 0.5, + "properties": { + "url": { + "type": "string", + "index": "no" }, + "title": { + "type": "string", + "include_in_all": true, + "analyzer": "standard" } } }, + "content": { + "type": "string", + "include_in_all": true, + "position_offset_gap": 128, + "analyzer": "standard" }, + "facts": { + "type": "object", + "enabled": false, + "index": "no" }, + "date_parsed" : { + "type" : "date", + "format" : "yyyy-MM-dd HH:mm:ss" } + "date_published" : { + "type" : "date", + "format" : "yyyy-MM-dd HH:mm:ss" } } +{% endhighlight %} + +A few notes on the specific fields: + + * `itemtype` - the generic result type in schema.org terms (e.g. Product, WebPage, Organization) + * `image` - a primary image from the page; it becomes a thumbnail on search results to make them more inviting + * `title` - usually based on the `title` tag or more-concise `og:title` data + * `keywords` - usually based on the keywords `meta` tag (the field is boosted because they're specifically targeted + phrases) + * `description` - usually the description `meta` tag + * `content` - any remaining useful, searchable content somebody might try to find something in + * `facts` - arbitrary data used for rendering more helpful search results; some common keys: + * `collection` - indicates there are multiple of something (e.g. product quantities, styles of a product) + * `product_model` - indicate a product model name for the result + * `brand` - indicate the brand name for the result + * `price`, `priceMin`, `priceMax` - indicate the price(s) of a result + * `availability` - for a product this is usually "in stock" or "out of stock" + * `date_published` - for content such as blog posts or announcements + +The `result` type is updated by the parse task which is responsible for: + + 1. Receiving a URL to parse; + 2. Finding the current `resource` record; + 3. Run the `response_content` through the appropriate structured data parser; + 4. Extract generic data (e.g. title, keywords); + 5. Extract `itemtype`-specific metadata, usually for `facts`; + 6. Update the `result` record. + +For example, this parsed [product model][17] looks like: + +{% highlight javascript %} +{ "url" : "/shop/g/yarn/madelinetosh/tosh-dk/", + "itemtype" : "ProductModel", + "title" : "Madelinetosh Tosh DK", + "keywords" : [ "tosh dk", "tosh dk yarn", "madelinetosh", "madelinetosh yarn", "madelinetosh tosh dk", "madelinetosh" ], + "image" : "/asset/catalog-entry-photo/17c1dc50-37ab-dac6-ca3c-9fd055a5b07f~v2-96x96.jpg", + "crumbs": [ + { + "url" : "/shop/", + "title" : "Shop" }, + { + "url" : "/shop/g/yarn/", + "title" : "Yarn" }, + { + "url" : "/shop/g/yarn/madelinetosh/", + "title" : "Madelinetosh" } ], + "content" : "Hand-dyed by the gals at Madelinetosh in Texas, you'll find these colors vibrant and multi-layered. Perfect for thick socks, scarves, shawls, hats, gloves, mitts and sweaters.", + "facts" : { + "collection": [ + { + "value" : 93, + "label" : "products" } ], + "brand" : "Madelinetosh", + "price" : "22.00" }, + "_boost" : 4 } +{% endhighlight %} + + +#### Searching + +Once some documents are indexed, I can create simple searches with the [`ruflin/Elastica`][11] library: + +{% highlight php %} +addMust( + (new \Elastica\Query\Bool()) + ->setParam('minimum_number_should_match', 1) + ->addShould( + (new \Elastica\Query\QueryString()) + ->setParam('default_field', 'keywords') + /* ...snip... */ ) + ->addShould( + (new \Elastica\Query\QueryString()) + ->setParam('default_field', 'title') + /* ...snip... */ ) + ->addShould( + (new \Elastica\Query\QueryString()) + ->setParam('default_field', 'content') + /* ...snip... */ ) ); + +/* ...snip... */ + +$query = new \Elastica\Query($bool); +{% endhighlight %} + +To easily focus specific matches in the `title` and `content` fields I can enable highlighting: + +{% highlight php %} +setHighlight( + array( + 'pre_tags' => array(''), + 'post_tags' => array(''), + 'fields' => array( + 'title' => array( + 'fragment_size' => 256, + 'number_of_fragments' => 1 ), + 'content' => array( + 'fragment_size' => 64, + 'number_of_fragments' => 3 ) ) ) ); +{% endhighlight %} + + +### Maintenance + +A search engine is no good if it's using outdated or no-longer-existant information. To help keep content up to date, I +take two approaches: + +**Time-based updates** - one of the reasons for the indexed `date_expires` field of the `resource` type is so an +process can go through and identify documents which have not been updated recently. If it sees something is stale, it +goes ahead and queues it for update. + +**Real-time updates** - sometimes things (like product availability) change frequently, impacting the quality of search +results. Instead of waiting for time-based updates, I use event listeners to trigger re-indexing when it sees things +inventory changes or product changes in an order. + +In either case, when a URL is discovered to be gone, the records from both `resource` and `result` are removed for the +URL. + + +#### Utilities + +Sometimes there are deploys where specific pages are definitely changing, or when a whole new sitemap is getting +registered with new URLs. Instead of waiting for the time-based updates or cron jobs to run, I have these commands +available for scripting: + + * `search:index-rebuild` - re-read the sitemaps and assert the links in the `resource` index + * `search:index-update` - find all the expired resources and queue them for update + * `search:result-rerun` - force the download and parsing of a URL + * `search:sitemap-generate` - regenerate all registered sitemaps + + +### Conclusion + +Starting with structured data and elasticsearch makes building a search engine significantly easier. Data and indexing +makes it faster to show smarter [search results][16]. Existing standards like [OpenSearch][12] make it easy to extend +the search from a web page into the [browser][15] and even third-party applications via [Atom][13] and [RSS][14] feeds. +Local, real-time updates ensures search results are timely and useful. Even with the basic parsing and ranking +algorithms shown here, results are quite accurate. It has been a beneficial experience to approach the website from the +perspective of a bot, giving me a better appreciation of how to efficiently markup and market content. + + + [1]: http://www.google.com/ + [2]: http://www.google.com/cse/all + [3]: http://www.google.com/enterprise/search/products_gss_pricing.html + [4]: http://www.theloopyewe.com/ + [5]: http://schema.org/ + [6]: http://www.sitemaps.org/ + [7]: /blog/2013/05/13/structured-data-with-schema-org.html + [8]: http://www.elasticsearch.org/ +[10]: http://www.theloopyewe.com/sitemap.xml +[11]: https://github.com/ruflin/Elastica/ +[12]: http://www.opensearch.org/Home +[13]: https://www.theloopyewe.com/search/results.atom?q=spring+fling +[14]: https://www.theloopyewe.com/search/results.rss?=spring+fling +[15]: https://www.theloopyewe.com/search/opensearch.xml +[16]: https://www.theloopyewe.com/search/?q=madelinetosh +[17]: https://www.theloopyewe.com/shop/g/yarn/madelinetosh/tosh-dk/