Unfortunately, I no longer have the time to maintain this book which is growing increasingly out of date (esp. with the upcoming Elasticsearch 2.0). I highly recommend checking out Elasticsearch, the Definitive Guide instead. This site will remain up indefinitely to prevent link rot.

Exploring Elasticsearch

A human-friendly tutorial for elasticsearch.

    Chapter 7 Product Search with Drilldown

    One of the most common use cases for elasticsearch is the ubiquitous query string with faceted drilldown, most frequently seen on e-commerce websites. Amazon is perhaps the best example of this. Let’s consider a search on for the term “Network Routing”. After executing such a search one is presented with a familiar interface: top product results on the right two thirds of the screen, and a left sidebar with ‘departments’ such as “Books”, and “Electronics”, similar to what’s seen in the figure below.

    images/figures/facet-ui-mockup

    Mapping this sort of UI to an elasticsearch query is a matter of generating two sets results for a given query, hits and facets. Conveniently, elasticsearch lets you easily combine a regular query with faceting. The essential strategy we’ll employ is to combine a free text query for the main body of results, along with terms facets to populate the left hand sidebar. Given the case where a user has clicked on a facet, thus restricting the search to that department alone, we will supply an additional filter against the ID of those departments, properly restricting the search result. Modeling this type of search is straightforward, if not always intuitive.

    7.1 Drilling Down One Level

    In this section, we’ll learn how to drill down a single level. Multiple-level drilldown will be covered later in this chapter.

    We’ll start by examining the schema for our fictional e-commerce site, seen in figure 7.1. Notice the fields for the department in our model. We’ll be using the department field to create those left-hand facets. The department for a document is stored in three separate ways. Once as an integer, which presumably is pointing to a primary key in our primary datastore (perhaps an RDBMS), once as a not_analyzed token so as to appear unmodified in facet aggregation as we’ll see later, and finally as a snowball analyzed field for natural language text matching. These choices have all been made to enable maximum flexibility while querying, which will be explained as we proceed through this chapter. The other fields are more straightforward, such as the snowball analyzed “name” and “description” fields, and an integer “price” field.

    F 7.1 Searching for a Phrase
    // Load Dataset: products.eloader
    GET products/_mapping?pretty=true
    // Response should be as follows below
    {
      "products": {
        "product": {
          "properties": {
            "department_id": {"type": "integer"},
            "department_name": {"type": "string", "index": "not_analyzed"},
            "department_name_analyzed": {"type": "string", "analyzer": "snowball"},
            "description": {"type": "string", "analyzer": "snowball"},
            "name": {"type": "string", "analyzer": "snowball"},
            "price": {"type": "integer"} }}}}
    

    We can get an idea of what our results will look like by issuing a simple faceted query, as in figure 7.2. This will return both a list of hits and a list of matching departments, with the count of documents matching a given department included. The returned department_id and department_name facet results should indicate that three products in books and two in electronics match.

    F 7.2 Simple Faceted Product Search
    // Load Dataset: products.eloader
    POST products/_search?pretty=true
    {
      "query": {
      "match": {"_all": "network routing"}},
      "facets":{
        "department_id": {"terms": {"field": "department_id"}},
        "department_name": {"terms": {"field": "department_name"}}}
    }
    

    Generating these queries in a real app is of course the application’s responsibility. It is for this reason that our schema includes the department_id field. In most real apps one would want to map facet names back to actual RDBMS or other authoritative records. While an un-tokenized not_analyzed name can also be used for lookups, using direct IDs is cleaner, and can be more reliable.

    The query we’ll use to drilldown is a simple modification of the one in figure 7.2. We must simply add a query filter for the relevant department_id, as illustrated in figure 7.3. This filter must be applied at the query level, via either a const_score or filtered query. Filters placed at other levels have unwanted effects, as described in the Filtering chapter.

    F 7.3 Simple Faceted Product Search, Filtered
    // Load Dataset: products.eloader
    POST products/_search?pretty=true
    {
      "query": {
        "filtered": {
          "query": {
             "match": {"_all": "network routing"}},
           "filter": {
             "term": {"department_id": 1}}
         }}}
    

    Drilling down to a single category is a simple matter of applying a filter. The reason we used a filter rather than a query is because filters provide a performance boost over queries due to the fact that they only restrict result sets, but do not affect result scores. In the next section we’ll learn how to construct a more complex query that drills down multiple levels.

    7.2 Drilling Down Multiple Levels

    We’ve seen how to drill-down a single level, using our department facets, now let’s examine having multiple facets that can be independently or jointly selected. We’ll make it possible to drill-down by both department and price. Firstly, let’s adjust our facets to include price in their counts. To do this we’ll use a different kind of facet, a histogram facet. Since there are many distinct values for price, it makes sense to group similar prices into range-based groups. The query in figure 7.4 facets facets prices based on $50 increments. Please note that our prices are stored as integer cents, hence $50 is stored as 5000.

    F 7.4 Histogram Faceting Prices
    // Load Dataset: products.eloader
    POST products/_search?pretty=true
    {
      "query": {
      "match": {"_all": "network routing"}},
      "facets":{
        "department_id": {"terms": {"field": "department_id"}},
        "department_name": {"terms": {"field": "department_name"}},
        "price": {"histogram": {"field": "price", "interval": 5000}}} }
    

    Our histogram facet returns a map of price ranges to the number of products falling within that price range. This could be presented on the sidebar as a series of checkboxes, allowing users to select price ranges they are most comfortable with. Our query for a single checkbox would be the same as in our single-level example, just with our additional facets. Since the facets are computed after the query is applied, the counts update to reflect that there are fewer results having drilled down a level. Examine figure 7.5, which shows the facet counts after having drilled down a single level.

    F 7.5 Histogram Faceting Prices, Drilled Down One Level
    // Load Dataset: products.eloader
    POST products/_search?pretty=true
    {
      "query": {
        "filtered": {
          "query": {
             "match": {"_all": "network routing"}},
           "filter": {
             "and": [
               {"range": {"price": {"gte": 0, "lte": 5000}}}
             ]} }},
      "facets":{
        "department_id": {"terms": {"field": "department_id"}},
        "department_name": {"terms": {"field": "department_name"}},
        "price": {"histogram": {"field": "price", "interval": 5000}}} }
    

    Drilling down a second level works similarly. If a user were to select the $0-$50 range, and also select the books department only, we would issue the query in figure 7.6. The query should match exactly one product, Network Routing in Example Land, the only book about network routing that is less than $50, costing $49.90.

    F 7.6 Faceted Product Search, Filtered on Two Fields
    // Load Dataset: products.eloader
    POST products/_search?pretty=true
    {
      "query": {
        "filtered": {
          "query": {
             "match": {"_all": "network routing"}},
           "filter": {
             "and": [
               {"term": {"department_id": 1}},
               {"range": {"price": {"gte": 0, "lte": 5000}}}
             ]
           } }
      },
      "facets":{
        "department_id": {"terms": {"field": "department_id"}},
        "department_name": {"terms": {"field": "department_name"}},
        "price": {"histogram": {"field": "price", "interval": 5000}}} }
    

    This chapter has been a tutorial in the basics of standard e-commerce searches with a facet drilldown. There are, however, many variations on this. One case we haven’t covered is searching tags, a topic covered in the next section.

    7.3 Searching and Faceting Tags

    7.3.1 Tag Drilldown and Faceting

    Tags are a common and powerful tool for modeling data. Their flexibility makes them pervasive. While basic use cases for tags have simple implementations, providing richer query options can be quite tricky.

    Tags generally have one of two search use cases: they are either used to filter, which is a straightforward task, or they are analyzed and matched against searches using the query, which is more complicated. We’ll cover both types of searches in this tutorial.

    To understand our modeling for tags in elasticsearch we’ll quickly recap multi-valued fields. Any field in an elasticsearch document can store either one, or many values, depending on whether the data is set to be either an array, or a single value. Lucene does not actually have a notion of arrays, elasticsearch adds this on top, but it does allow any field to hold multiple values. This is an ideal fit for elasticsearch, where we may have a document with multiple tags. We can have one “tags” field in our document, typed as a string, and then provide multiple values for it.

    Multi-valued fields work quite simply when the tag content is mapped as not_analyzed static tokens. In this case, a single match query will search the tokens with ease. Working with analyzed tags, however, is much harder. Let’s start by covering the simple case, indexing single token not_analyzed tags.

    We’ll be able to re-use the schema in Figure 7.2. We’ll start by building a drilldown based on the department_name field, then move on building a free-text search that can work with the department_name_analyzed field.

    7.3.2 Working With Non-Analyzed Tags

    Let’s start by looking at the filtering using the department_name field, which has the not_analyzed option set. We’ll be loading a new dataset, products_multi_tagged.eloader, which is very similar to the products.eloader dataset used previously in this chapter. The main difference in this dataset is the addition of multiple tags per product. Let’s take a look at the new data by issuing the query in figure 7.7.

    F 7.7 . Faceted Product Search With Multiple Tagged Data
    // Load Dataset: products_multi_tagged.eloader
    POST products_multi_tagged/_search?pretty=true
    {
      "query": {"match_all": {}},
      "facets":{
        "department_name": {"terms": {"field": "department_name"}}}
    }
    

    The responses in the facet section for the query in figure 7.7 should look something like that in figure 7.8. This is a single item from the array of results in the facets for the query. The rest has been omitted for brevity’s sake. Notice how multiple values have been set for the department_* fields.

    F 7.8 Partial Facet Result from Figure 7.7
    {
      "department_name": ["Housewares", "Kitchen", "Breakast"],
      "department_name_analyzed": ["Housewares", "Kitchen", "Breakast"],
      "department_id": 2,
      "name": "TCPIntl.CoffeeMaker",
      "description": "Deluxe Coffee Maker, made by the trusted folks at TCP International.",
       "price": 4999
    }
    

    Searching our multi-valued data is an easy task for a match query. We can simply provide a boolean filter comprised of multiple match queries as must clauses. This will ensure that we match all documents that have exactly those tags. There’s a strong reason to use filters instead of queries in these cases: tags should not have any effect on the final score because we’re simply using them to restrict the dataset. Since similarity is based off calculations hinging on tf*idf scores, this can lead to nonsensically scored results depending on the frequency of filtered terms.

    To search for items in both “Housewares” and “Kitchen”, we would issue a query such as that in figure 7.9. This will correctly return the one item we have in both of these departments. Notice that the filter terms match the department names exactly. Alternatively, we could have used the department_id field which is also available, but we’ll stick to names to make these examples more clear.

    F 7.9 . Multi Faceted Product Search Basic Filtering
    // Load Dataset: products_multi_tagged.eloader
    POST products_multi_tagged/_search?pretty=true
    {
      "query": {
        "filtered": {
          "filter": {
            "bool": {
           "should": [
             {"term": {"department_name": "Housewares"}},
             {"term": {"department_name": "Kitchen"}}
           ] }}}},
      "facets": {
        "department_name": {
          "terms": {"field": "department_name"}}}}
    

    7.3.3 Working With Analyzed Tags

    We’ve now seen how it’s possible to perform multi-tag facet drill-downs in the simple case where an exact term for each tag is known. For situations where we want tags considered as part of free-text search, things get trickier. In our previous examples we wanted to use our department_name field as a pure filter only. However, a search for “Kitchen TCP”, should turn up the coffee maker by TCP Industries first, not books on networking. This is the primary use-case we’ll analyze going forward in this section.

    We’re going to solve this problem by using the position_offset_gap mapping option. If you examine the mapping for the products_multi_tagged index, you’ll notice that the mapping for department_name_analyzed has the setting "position_offset_gap": 1000. This option adjusts the document position data stored alongside terms in Lucene indexes. A position_offset_gap setting of n artificially inflates the distance of values in multi valued fields by a given amount. It is equivalent to concatenating all the values into a single value padded by n stop words. Altering our match with the phrase option, and setting the phrase slop to a number < n + the max length of a value, enables us to isolate matches to individual values within our multi-valued field. Since calculating that value can be tricky, it is recommended to use absurdly large values.

    Let’s break down the query in figure 7.10. In this example we’ve changed quite a bit from figure 7.9. We’ve switched away from a filtered query, and have replaced it with a bool query.

    F 7.10 . Using Analyzed Tags as a Factor in Search
    // Load Dataset: products_multi_tagged.eloader
    POST products_multi_tagged/_search?pretty=true
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "department_name_analyzed": {
                  "type": "phrase",
                  "slop": 10, 
                  "query": "kitchen tcp"}}},
            {"match": {"name": "kitchen tcp"}},
            {"match": {"description": "kitchen tcp"}} ]}},
      "facets": {
        "department_name": {
          "terms": {"field": "department_name"}}}}
    

    The intent behind this query is to handle free text input originating from a user facing search input. The user’s query is seen duplicated across all 3 clauses of the boolean query, a task normally done by application code dynamically building the query. A boolean query is used since we need to search across multiple fields with different options. Since each query in the should clause is a match query, and since multiple terms within a match query are matched with a logical ‘or’, we only need at least one term of the search phrase to match. In this case, at least one field must have the term “kitchen” or “tcp”. Documents that have both fields will score higher. Documents with both terms present more frequently and/or closer together will score higher still.

    It is for these reasons that the top result is the “TCP Intl. Coffee Maker”. Its high score is due to its being tagged as “Kitchen” and containing the term term “TCP” in its title. Note that the word “Kitchen” is analyzed, as the lowercase “kitchen” still matches. Other documents do match, such as the kitchen spatula, and some TCP related networking products, but they do not rank as highly. This is the bare framework for implementing this kind of search, a full version would probably also include synonyms for “Kitchen” in the department_name_analyzed field, such as “Cooking”, “Food”, and “Culinary”.

comments powered by Disqus