add search-engine-based-on-structured-data post
This commit is contained in:
315
blog/_posts/2013-06-01-search-engine-based-on-structured-data.md
Normal file
315
blog/_posts/2013-06-01-search-engine-based-on-structured-data.md
Normal file
@@ -0,0 +1,315 @@
|
||||
---
|
||||
title: The Basics of a Custom Search Engine
|
||||
layout: post
|
||||
tags: elasticsearch gearmand schema.org search sitemap structured-data
|
||||
description: Combining elasticsearch and "structured data" to create a self-hosted search engine.
|
||||
---
|
||||
|
||||
One of the most useful features of a website is the ability to search. [The Loopy Ewe][4] has had some form of faceted
|
||||
product search for a long time, but it has never had the ability to quickly find regular pages, categories, brands, blog
|
||||
posts and the like. [Google][1] seems to lead in offering custom search products with both [Custom Search Engine][2] and
|
||||
[Site Search][3], but they're either branded or cost a bit of money. Instead of investing in their proprietary products,
|
||||
I wanted to try to create a simple search engine for our needs which took advantage of my previous work in implementing
|
||||
existing open standards.
|
||||
|
||||
|
||||
### Introduction
|
||||
|
||||
In my mind, there are four basic processes when creating a search engine:
|
||||
|
||||
**Discovery** - finding the documents that are worthy of indexing. This step was fairly easy since I had already setup
|
||||
a [sitemap][6] for the site. Internally, the feature bundles of the site are responsible for generating their own
|
||||
sitemap (e.g. blog posts, regular content pages, photo galleries, products, product groups) and [`sitemap.xml`][10] just
|
||||
advertises them. So, for our purposes, the discovery step just involves reviewing those sitemaps to find the links.
|
||||
|
||||
**Parsing** - understanding the documents to know what content is significant. Given my previous work of [implementing
|
||||
structured data][7] on the site and creating internal tools for reviewing the results, parsing becomes a very simple
|
||||
task.
|
||||
|
||||
The next two processes are more what I want to focus on here:
|
||||
|
||||
* **Indexing** - ensuring the documents are accessible via search queries.
|
||||
* **Maintenance** - keeping the documents updated when they are updated or removed.
|
||||
|
||||
|
||||
### Indexing
|
||||
|
||||
We were already using [elasticsearch][8], so I was hoping to use it for full-text searching as well. I decided to
|
||||
maintain two types in the search index.
|
||||
|
||||
|
||||
#### Discovered Documents (`resource`)
|
||||
|
||||
The `resource` type has all our indexed URLs and a cache of their contents. Since we're not going to be searching it
|
||||
directly, it's more of a basic key-based storage based on the URL. The mapping looks something like:
|
||||
|
||||
{% highlight javascript %}
|
||||
{ "_id" : {
|
||||
"type" : "string" },
|
||||
"url" : {
|
||||
"type" : "string",
|
||||
"index" : "no" },
|
||||
"response_status" : {
|
||||
"type" : "string",
|
||||
"index" : "no" },
|
||||
"response_headers" : {
|
||||
"properties" : {
|
||||
"key" : {
|
||||
"type" : "string",
|
||||
"index" : "no" },
|
||||
"value" : {
|
||||
"type" : "string",
|
||||
"index" : "no" } } },
|
||||
"response_content" : {
|
||||
"type" : "string",
|
||||
"index" : "no" },
|
||||
"date_retrieved" : {
|
||||
"type" : "date",
|
||||
"format" : "yyyy-MM-dd HH:mm:ss" },
|
||||
"date_expires" : {
|
||||
"type" : "date",
|
||||
"format" : "yyyy-MM-dd HH:mm:ss" } }
|
||||
{% endhighlight %}
|
||||
|
||||
The `_id` is simply a hash of the actual URL and used elsewhere. Whenever the discovery process finds a new URL, it
|
||||
creates a new record and queues a task to download the document. The initial record looks like:
|
||||
|
||||
{% highlight javascript %}
|
||||
{
|
||||
"_id" : "b48d426138096d66bfaa4ac9dcbc4cb6",
|
||||
"url" : "/local/fling/spring-fling-2013/",
|
||||
"date_expires" : "2001-01-01 00:00:00"
|
||||
}
|
||||
{% endhighlight %}
|
||||
|
||||
Then the download task is responsible for:
|
||||
|
||||
1. Receiving a URL to download;
|
||||
2. Finding the current `resource` record;
|
||||
3. Validating it against `robots.txt`;
|
||||
4. Sending a new request for the URL (respecting `ETag` and `Last-Modified` headers);
|
||||
5. Updating the `resource` record with the response and new `date_*` values;
|
||||
6. And, if the document has changed, queueing a task to parse the `resource`.
|
||||
|
||||
By default, if an `Expires` response header isn't provided, I set the `date_expires` field to several days in the
|
||||
future. The field is used to find stale documents later on.
|
||||
|
||||
|
||||
#### Parsed Documents (`result`)
|
||||
|
||||
The `result` type has all our indexed URLs which were parsed and found to be useful. The documents contain some
|
||||
structured fields which are generated by the parsing step. The mapping looks like:
|
||||
|
||||
{% highlight javascript %}
|
||||
{ "_id": {
|
||||
"type": "string" },
|
||||
"url": {
|
||||
"type": "string",
|
||||
"index": "no" },
|
||||
"itemtype": {
|
||||
"type": "string",
|
||||
"analyzer": "keyword" },
|
||||
"image": {
|
||||
"type": "string",
|
||||
"index": "no" },
|
||||
"title": {
|
||||
"boost": 5.0,
|
||||
"type": "string",
|
||||
"include_in_all": true,
|
||||
"position_offset_gap": 64,
|
||||
"index_analyzer": "snowballed",
|
||||
"search_analyzer": "snowballed_searcher" },
|
||||
"keywords": {
|
||||
"_boost": 6.0,
|
||||
"type": "string",
|
||||
"include_in_all": true,
|
||||
"index_analyzer": "snowballed",
|
||||
"search_analyzer": "snowballed_searcher" },
|
||||
"description": {
|
||||
"_boost": 3.0,
|
||||
"type": "string",
|
||||
"analyzer": "standard" },
|
||||
"crumbs": {
|
||||
"boost": 0.5,
|
||||
"properties": {
|
||||
"url": {
|
||||
"type": "string",
|
||||
"index": "no" },
|
||||
"title": {
|
||||
"type": "string",
|
||||
"include_in_all": true,
|
||||
"analyzer": "standard" } } },
|
||||
"content": {
|
||||
"type": "string",
|
||||
"include_in_all": true,
|
||||
"position_offset_gap": 128,
|
||||
"analyzer": "standard" },
|
||||
"facts": {
|
||||
"type": "object",
|
||||
"enabled": false,
|
||||
"index": "no" },
|
||||
"date_parsed" : {
|
||||
"type" : "date",
|
||||
"format" : "yyyy-MM-dd HH:mm:ss" }
|
||||
"date_published" : {
|
||||
"type" : "date",
|
||||
"format" : "yyyy-MM-dd HH:mm:ss" } }
|
||||
{% endhighlight %}
|
||||
|
||||
A few notes on the specific fields:
|
||||
|
||||
* `itemtype` - the generic result type in schema.org terms (e.g. Product, WebPage, Organization)
|
||||
* `image` - a primary image from the page; it becomes a thumbnail on search results to make them more inviting
|
||||
* `title` - usually based on the `title` tag or more-concise `og:title` data
|
||||
* `keywords` - usually based on the keywords `meta` tag (the field is boosted because they're specifically targeted
|
||||
phrases)
|
||||
* `description` - usually the description `meta` tag
|
||||
* `content` - any remaining useful, searchable content somebody might try to find something in
|
||||
* `facts` - arbitrary data used for rendering more helpful search results; some common keys:
|
||||
* `collection` - indicates there are multiple of something (e.g. product quantities, styles of a product)
|
||||
* `product_model` - indicate a product model name for the result
|
||||
* `brand` - indicate the brand name for the result
|
||||
* `price`, `priceMin`, `priceMax` - indicate the price(s) of a result
|
||||
* `availability` - for a product this is usually "in stock" or "out of stock"
|
||||
* `date_published` - for content such as blog posts or announcements
|
||||
|
||||
The `result` type is updated by the parse task which is responsible for:
|
||||
|
||||
1. Receiving a URL to parse;
|
||||
2. Finding the current `resource` record;
|
||||
3. Run the `response_content` through the appropriate structured data parser;
|
||||
4. Extract generic data (e.g. title, keywords);
|
||||
5. Extract `itemtype`-specific metadata, usually for `facts`;
|
||||
6. Update the `result` record.
|
||||
|
||||
For example, this parsed [product model][17] looks like:
|
||||
|
||||
{% highlight javascript %}
|
||||
{ "url" : "/shop/g/yarn/madelinetosh/tosh-dk/",
|
||||
"itemtype" : "ProductModel",
|
||||
"title" : "Madelinetosh Tosh DK",
|
||||
"keywords" : [ "tosh dk", "tosh dk yarn", "madelinetosh", "madelinetosh yarn", "madelinetosh tosh dk", "madelinetosh" ],
|
||||
"image" : "/asset/catalog-entry-photo/17c1dc50-37ab-dac6-ca3c-9fd055a5b07f~v2-96x96.jpg",
|
||||
"crumbs": [
|
||||
{
|
||||
"url" : "/shop/",
|
||||
"title" : "Shop" },
|
||||
{
|
||||
"url" : "/shop/g/yarn/",
|
||||
"title" : "Yarn" },
|
||||
{
|
||||
"url" : "/shop/g/yarn/madelinetosh/",
|
||||
"title" : "Madelinetosh" } ],
|
||||
"content" : "Hand-dyed by the gals at Madelinetosh in Texas, you'll find these colors vibrant and multi-layered. Perfect for thick socks, scarves, shawls, hats, gloves, mitts and sweaters.",
|
||||
"facts" : {
|
||||
"collection": [
|
||||
{
|
||||
"value" : 93,
|
||||
"label" : "products" } ],
|
||||
"brand" : "Madelinetosh",
|
||||
"price" : "22.00" },
|
||||
"_boost" : 4 }
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
#### Searching
|
||||
|
||||
Once some documents are indexed, I can create simple searches with the [`ruflin/Elastica`][11] library:
|
||||
|
||||
{% highlight php %}
|
||||
<?php
|
||||
$bool = (new \Elastica\Query\Bool())
|
||||
->addMust(
|
||||
(new \Elastica\Query\Bool())
|
||||
->setParam('minimum_number_should_match', 1)
|
||||
->addShould(
|
||||
(new \Elastica\Query\QueryString())
|
||||
->setParam('default_field', 'keywords')
|
||||
/* ...snip... */ )
|
||||
->addShould(
|
||||
(new \Elastica\Query\QueryString())
|
||||
->setParam('default_field', 'title')
|
||||
/* ...snip... */ )
|
||||
->addShould(
|
||||
(new \Elastica\Query\QueryString())
|
||||
->setParam('default_field', 'content')
|
||||
/* ...snip... */ ) );
|
||||
|
||||
/* ...snip... */
|
||||
|
||||
$query = new \Elastica\Query($bool);
|
||||
{% endhighlight %}
|
||||
|
||||
To easily focus specific matches in the `title` and `content` fields I can enable highlighting:
|
||||
|
||||
{% highlight php %}
|
||||
<?php
|
||||
$query->setHighlight(
|
||||
array(
|
||||
'pre_tags' => array('<strong>'),
|
||||
'post_tags' => array('</strong>'),
|
||||
'fields' => array(
|
||||
'title' => array(
|
||||
'fragment_size' => 256,
|
||||
'number_of_fragments' => 1 ),
|
||||
'content' => array(
|
||||
'fragment_size' => 64,
|
||||
'number_of_fragments' => 3 ) ) ) );
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
### Maintenance
|
||||
|
||||
A search engine is no good if it's using outdated or no-longer-existant information. To help keep content up to date, I
|
||||
take two approaches:
|
||||
|
||||
**Time-based updates** - one of the reasons for the indexed `date_expires` field of the `resource` type is so an
|
||||
process can go through and identify documents which have not been updated recently. If it sees something is stale, it
|
||||
goes ahead and queues it for update.
|
||||
|
||||
**Real-time updates** - sometimes things (like product availability) change frequently, impacting the quality of search
|
||||
results. Instead of waiting for time-based updates, I use event listeners to trigger re-indexing when it sees things
|
||||
inventory changes or product changes in an order.
|
||||
|
||||
In either case, when a URL is discovered to be gone, the records from both `resource` and `result` are removed for the
|
||||
URL.
|
||||
|
||||
|
||||
#### Utilities
|
||||
|
||||
Sometimes there are deploys where specific pages are definitely changing, or when a whole new sitemap is getting
|
||||
registered with new URLs. Instead of waiting for the time-based updates or cron jobs to run, I have these commands
|
||||
available for scripting:
|
||||
|
||||
* `search:index-rebuild` - re-read the sitemaps and assert the links in the `resource` index
|
||||
* `search:index-update` - find all the expired resources and queue them for update
|
||||
* `search:result-rerun` - force the download and parsing of a URL
|
||||
* `search:sitemap-generate` - regenerate all registered sitemaps
|
||||
|
||||
|
||||
### Conclusion
|
||||
|
||||
Starting with structured data and elasticsearch makes building a search engine significantly easier. Data and indexing
|
||||
makes it faster to show smarter [search results][16]. Existing standards like [OpenSearch][12] make it easy to extend
|
||||
the search from a web page into the [browser][15] and even third-party applications via [Atom][13] and [RSS][14] feeds.
|
||||
Local, real-time updates ensures search results are timely and useful. Even with the basic parsing and ranking
|
||||
algorithms shown here, results are quite accurate. It has been a beneficial experience to approach the website from the
|
||||
perspective of a bot, giving me a better appreciation of how to efficiently markup and market content.
|
||||
|
||||
|
||||
[1]: http://www.google.com/
|
||||
[2]: http://www.google.com/cse/all
|
||||
[3]: http://www.google.com/enterprise/search/products_gss_pricing.html
|
||||
[4]: http://www.theloopyewe.com/
|
||||
[5]: http://schema.org/
|
||||
[6]: http://www.sitemaps.org/
|
||||
[7]: /blog/2013/05/13/structured-data-with-schema-org.html
|
||||
[8]: http://www.elasticsearch.org/
|
||||
[10]: http://www.theloopyewe.com/sitemap.xml
|
||||
[11]: https://github.com/ruflin/Elastica/
|
||||
[12]: http://www.opensearch.org/Home
|
||||
[13]: https://www.theloopyewe.com/search/results.atom?q=spring+fling
|
||||
[14]: https://www.theloopyewe.com/search/results.rss?=spring+fling
|
||||
[15]: https://www.theloopyewe.com/search/opensearch.xml
|
||||
[16]: https://www.theloopyewe.com/search/?q=madelinetosh
|
||||
[17]: https://www.theloopyewe.com/shop/g/yarn/madelinetosh/tosh-dk/
|
||||
Reference in New Issue
Block a user