add search-engine-based-on-structured-data post

This commit is contained in:
Danny Berger
2013-06-01 18:49:56 -06:00
parent da793dda54
commit 488726a842

View File

@@ -0,0 +1,315 @@
---
title: The Basics of a Custom Search Engine
layout: post
tags: elasticsearch gearmand schema.org search sitemap structured-data
description: Combining elasticsearch and "structured data" to create a self-hosted search engine.
---
One of the most useful features of a website is the ability to search. [The Loopy Ewe][4] has had some form of faceted
product search for a long time, but it has never had the ability to quickly find regular pages, categories, brands, blog
posts and the like. [Google][1] seems to lead in offering custom search products with both [Custom Search Engine][2] and
[Site Search][3], but they're either branded or cost a bit of money. Instead of investing in their proprietary products,
I wanted to try to create a simple search engine for our needs which took advantage of my previous work in implementing
existing open standards.
### Introduction
In my mind, there are four basic processes when creating a search engine:
**Discovery** - finding the documents that are worthy of indexing. This step was fairly easy since I had already setup
a [sitemap][6] for the site. Internally, the feature bundles of the site are responsible for generating their own
sitemap (e.g. blog posts, regular content pages, photo galleries, products, product groups) and [`sitemap.xml`][10] just
advertises them. So, for our purposes, the discovery step just involves reviewing those sitemaps to find the links.
**Parsing** - understanding the documents to know what content is significant. Given my previous work of [implementing
structured data][7] on the site and creating internal tools for reviewing the results, parsing becomes a very simple
task.
The next two processes are more what I want to focus on here:
* **Indexing** - ensuring the documents are accessible via search queries.
* **Maintenance** - keeping the documents updated when they are updated or removed.
### Indexing
We were already using [elasticsearch][8], so I was hoping to use it for full-text searching as well. I decided to
maintain two types in the search index.
#### Discovered Documents (`resource`)
The `resource` type has all our indexed URLs and a cache of their contents. Since we're not going to be searching it
directly, it's more of a basic key-based storage based on the URL. The mapping looks something like:
{% highlight javascript %}
{ "_id" : {
"type" : "string" },
"url" : {
"type" : "string",
"index" : "no" },
"response_status" : {
"type" : "string",
"index" : "no" },
"response_headers" : {
"properties" : {
"key" : {
"type" : "string",
"index" : "no" },
"value" : {
"type" : "string",
"index" : "no" } } },
"response_content" : {
"type" : "string",
"index" : "no" },
"date_retrieved" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" },
"date_expires" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" } }
{% endhighlight %}
The `_id` is simply a hash of the actual URL and used elsewhere. Whenever the discovery process finds a new URL, it
creates a new record and queues a task to download the document. The initial record looks like:
{% highlight javascript %}
{
"_id" : "b48d426138096d66bfaa4ac9dcbc4cb6",
"url" : "/local/fling/spring-fling-2013/",
"date_expires" : "2001-01-01 00:00:00"
}
{% endhighlight %}
Then the download task is responsible for:
1. Receiving a URL to download;
2. Finding the current `resource` record;
3. Validating it against `robots.txt`;
4. Sending a new request for the URL (respecting `ETag` and `Last-Modified` headers);
5. Updating the `resource` record with the response and new `date_*` values;
6. And, if the document has changed, queueing a task to parse the `resource`.
By default, if an `Expires` response header isn't provided, I set the `date_expires` field to several days in the
future. The field is used to find stale documents later on.
#### Parsed Documents (`result`)
The `result` type has all our indexed URLs which were parsed and found to be useful. The documents contain some
structured fields which are generated by the parsing step. The mapping looks like:
{% highlight javascript %}
{ "_id": {
"type": "string" },
"url": {
"type": "string",
"index": "no" },
"itemtype": {
"type": "string",
"analyzer": "keyword" },
"image": {
"type": "string",
"index": "no" },
"title": {
"boost": 5.0,
"type": "string",
"include_in_all": true,
"position_offset_gap": 64,
"index_analyzer": "snowballed",
"search_analyzer": "snowballed_searcher" },
"keywords": {
"_boost": 6.0,
"type": "string",
"include_in_all": true,
"index_analyzer": "snowballed",
"search_analyzer": "snowballed_searcher" },
"description": {
"_boost": 3.0,
"type": "string",
"analyzer": "standard" },
"crumbs": {
"boost": 0.5,
"properties": {
"url": {
"type": "string",
"index": "no" },
"title": {
"type": "string",
"include_in_all": true,
"analyzer": "standard" } } },
"content": {
"type": "string",
"include_in_all": true,
"position_offset_gap": 128,
"analyzer": "standard" },
"facts": {
"type": "object",
"enabled": false,
"index": "no" },
"date_parsed" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" }
"date_published" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" } }
{% endhighlight %}
A few notes on the specific fields:
* `itemtype` - the generic result type in schema.org terms (e.g. Product, WebPage, Organization)
* `image` - a primary image from the page; it becomes a thumbnail on search results to make them more inviting
* `title` - usually based on the `title` tag or more-concise `og:title` data
* `keywords` - usually based on the keywords `meta` tag (the field is boosted because they're specifically targeted
phrases)
* `description` - usually the description `meta` tag
* `content` - any remaining useful, searchable content somebody might try to find something in
* `facts` - arbitrary data used for rendering more helpful search results; some common keys:
* `collection` - indicates there are multiple of something (e.g. product quantities, styles of a product)
* `product_model` - indicate a product model name for the result
* `brand` - indicate the brand name for the result
* `price`, `priceMin`, `priceMax` - indicate the price(s) of a result
* `availability` - for a product this is usually "in stock" or "out of stock"
* `date_published` - for content such as blog posts or announcements
The `result` type is updated by the parse task which is responsible for:
1. Receiving a URL to parse;
2. Finding the current `resource` record;
3. Run the `response_content` through the appropriate structured data parser;
4. Extract generic data (e.g. title, keywords);
5. Extract `itemtype`-specific metadata, usually for `facts`;
6. Update the `result` record.
For example, this parsed [product model][17] looks like:
{% highlight javascript %}
{ "url" : "/shop/g/yarn/madelinetosh/tosh-dk/",
"itemtype" : "ProductModel",
"title" : "Madelinetosh Tosh DK",
"keywords" : [ "tosh dk", "tosh dk yarn", "madelinetosh", "madelinetosh yarn", "madelinetosh tosh dk", "madelinetosh" ],
"image" : "/asset/catalog-entry-photo/17c1dc50-37ab-dac6-ca3c-9fd055a5b07f~v2-96x96.jpg",
"crumbs": [
{
"url" : "/shop/",
"title" : "Shop" },
{
"url" : "/shop/g/yarn/",
"title" : "Yarn" },
{
"url" : "/shop/g/yarn/madelinetosh/",
"title" : "Madelinetosh" } ],
"content" : "Hand-dyed by the gals at Madelinetosh in Texas, you'll find these colors vibrant and multi-layered. Perfect for thick socks, scarves, shawls, hats, gloves, mitts and sweaters.",
"facts" : {
"collection": [
{
"value" : 93,
"label" : "products" } ],
"brand" : "Madelinetosh",
"price" : "22.00" },
"_boost" : 4 }
{% endhighlight %}
#### Searching
Once some documents are indexed, I can create simple searches with the [`ruflin/Elastica`][11] library:
{% highlight php %}
<?php
$bool = (new \Elastica\Query\Bool())
->addMust(
(new \Elastica\Query\Bool())
->setParam('minimum_number_should_match', 1)
->addShould(
(new \Elastica\Query\QueryString())
->setParam('default_field', 'keywords')
/* ...snip... */ )
->addShould(
(new \Elastica\Query\QueryString())
->setParam('default_field', 'title')
/* ...snip... */ )
->addShould(
(new \Elastica\Query\QueryString())
->setParam('default_field', 'content')
/* ...snip... */ ) );
/* ...snip... */
$query = new \Elastica\Query($bool);
{% endhighlight %}
To easily focus specific matches in the `title` and `content` fields I can enable highlighting:
{% highlight php %}
<?php
$query->setHighlight(
array(
'pre_tags' => array('<strong>'),
'post_tags' => array('</strong>'),
'fields' => array(
'title' => array(
'fragment_size' => 256,
'number_of_fragments' => 1 ),
'content' => array(
'fragment_size' => 64,
'number_of_fragments' => 3 ) ) ) );
{% endhighlight %}
### Maintenance
A search engine is no good if it's using outdated or no-longer-existant information. To help keep content up to date, I
take two approaches:
**Time-based updates** - one of the reasons for the indexed `date_expires` field of the `resource` type is so an
process can go through and identify documents which have not been updated recently. If it sees something is stale, it
goes ahead and queues it for update.
**Real-time updates** - sometimes things (like product availability) change frequently, impacting the quality of search
results. Instead of waiting for time-based updates, I use event listeners to trigger re-indexing when it sees things
inventory changes or product changes in an order.
In either case, when a URL is discovered to be gone, the records from both `resource` and `result` are removed for the
URL.
#### Utilities
Sometimes there are deploys where specific pages are definitely changing, or when a whole new sitemap is getting
registered with new URLs. Instead of waiting for the time-based updates or cron jobs to run, I have these commands
available for scripting:
* `search:index-rebuild` - re-read the sitemaps and assert the links in the `resource` index
* `search:index-update` - find all the expired resources and queue them for update
* `search:result-rerun` - force the download and parsing of a URL
* `search:sitemap-generate` - regenerate all registered sitemaps
### Conclusion
Starting with structured data and elasticsearch makes building a search engine significantly easier. Data and indexing
makes it faster to show smarter [search results][16]. Existing standards like [OpenSearch][12] make it easy to extend
the search from a web page into the [browser][15] and even third-party applications via [Atom][13] and [RSS][14] feeds.
Local, real-time updates ensures search results are timely and useful. Even with the basic parsing and ranking
algorithms shown here, results are quite accurate. It has been a beneficial experience to approach the website from the
perspective of a bot, giving me a better appreciation of how to efficiently markup and market content.
[1]: http://www.google.com/
[2]: http://www.google.com/cse/all
[3]: http://www.google.com/enterprise/search/products_gss_pricing.html
[4]: http://www.theloopyewe.com/
[5]: http://schema.org/
[6]: http://www.sitemaps.org/
[7]: /blog/2013/05/13/structured-data-with-schema-org.html
[8]: http://www.elasticsearch.org/
[10]: http://www.theloopyewe.com/sitemap.xml
[11]: https://github.com/ruflin/Elastica/
[12]: http://www.opensearch.org/Home
[13]: https://www.theloopyewe.com/search/results.atom?q=spring+fling
[14]: https://www.theloopyewe.com/search/results.rss?=spring+fling
[15]: https://www.theloopyewe.com/search/opensearch.xml
[16]: https://www.theloopyewe.com/search/?q=madelinetosh
[17]: https://www.theloopyewe.com/shop/g/yarn/madelinetosh/tosh-dk/