Chapter 7 Product Search with Drilldown
One of the most common use cases for elasticsearch is the ubiquitous query string with faceted drilldown, most frequently seen on e-commerce websites. Amazon is perhaps the best example of this. Let’s consider a search on for the term “Network Routing”. After executing such a search one is presented with a familiar interface: top product results on the right two thirds of the screen, and a left sidebar with ‘departments’ such as “Books”, and “Electronics”, similar to what’s seen in the figure below.

Mapping this sort of UI to an elasticsearch query is a matter of generating two sets results for a given query, hits and facets. Conveniently, elasticsearch lets you easily combine a regular query with faceting. The essential strategy we’ll employ is to combine a free text query for the main body of results, along with terms
facets to populate the left hand sidebar. Given the case where a user has clicked on a facet, thus restricting the search to that department alone, we will supply an additional filter against the ID of those departments, properly restricting the search result. Modeling this type of search is straightforward, if not always intuitive.
7.1 Drilling Down One Level
In this section, we’ll learn how to drill down a single level. Multiple-level drilldown will be covered later in this chapter.
We’ll start by examining the schema for our fictional e-commerce site, seen in figure 7.1. Notice the fields for the department in our model. We’ll be using the department field to create those left-hand facets. The department for a document is stored in three separate ways. Once as an integer, which presumably is pointing to a primary key in our primary datastore (perhaps an RDBMS), once as a not_analyzed
token so as to appear unmodified in facet aggregation as we’ll see later, and finally as a snowball analyzed field for natural language text matching. These choices have all been made to enable maximum flexibility while querying, which will be explained as we proceed through this chapter. The other fields are more straightforward, such as the snowball analyzed “name” and “description” fields, and an integer “price” field.
// Load Dataset: products.eloader
GET products/_mapping?pretty=true
// Response should be as follows below
{
"products": {
"product": {
"properties": {
"department_id": {"type": "integer"},
"department_name": {"type": "string", "index": "not_analyzed"},
"department_name_analyzed": {"type": "string", "analyzer": "snowball"},
"description": {"type": "string", "analyzer": "snowball"},
"name": {"type": "string", "analyzer": "snowball"},
"price": {"type": "integer"} }}}}
We can get an idea of what our results will look like by issuing a simple faceted query, as in figure 7.2. This will return both a list of hits and a list of matching departments, with the count of documents matching a given department included. The returned department_id
and department_name
facet results should indicate that three products in books and two in electronics match.
// Load Dataset: products.eloader
POST products/_search?pretty=true
{
"query": {
"match": {"_all": "network routing"}},
"facets":{
"department_id": {"terms": {"field": "department_id"}},
"department_name": {"terms": {"field": "department_name"}}}
}
Generating these queries in a real app is of course the application’s responsibility. It is for this reason that our schema includes the department_id
field. In most real apps one would want to map facet names back to actual RDBMS or other authoritative records. While an un-tokenized not_analyzed
name can also be used for lookups, using direct IDs is cleaner, and can be more reliable.
The query we’ll use to drilldown is a simple modification of the one in figure 7.2. We must simply add a query filter for the relevant department_id
, as illustrated in figure 7.3. This filter must be applied at the query level, via either a const_score
or filtered
query. Filters placed at other levels have unwanted effects, as described in the Filtering chapter.
// Load Dataset: products.eloader
POST products/_search?pretty=true
{
"query": {
"filtered": {
"query": {
"match": {"_all": "network routing"}},
"filter": {
"term": {"department_id": 1}}
}}}
Drilling down to a single category is a simple matter of applying a filter. The reason we used a filter rather than a query is because filters provide a performance boost over queries due to the fact that they only restrict result sets, but do not affect result scores. In the next section we’ll learn how to construct a more complex query that drills down multiple levels.
7.2 Drilling Down Multiple Levels
We’ve seen how to drill-down a single level, using our department facets, now let’s examine having multiple facets that can be independently or jointly selected. We’ll make it possible to drill-down by both department and price. Firstly, let’s adjust our facets to include price in their counts. To do this we’ll use a different kind of facet, a histogram
facet. Since there are many distinct values for price, it makes sense to group similar prices into range-based groups. The query in figure 7.4 facets facets prices based on $50 increments. Please note that our prices are stored as integer cents, hence $50 is stored as 5000.
// Load Dataset: products.eloader
POST products/_search?pretty=true
{
"query": {
"match": {"_all": "network routing"}},
"facets":{
"department_id": {"terms": {"field": "department_id"}},
"department_name": {"terms": {"field": "department_name"}},
"price": {"histogram": {"field": "price", "interval": 5000}}} }
Our histogram facet returns a map of price ranges to the number of products falling within that price range. This could be presented on the sidebar as a series of checkboxes, allowing users to select price ranges they are most comfortable with. Our query for a single checkbox would be the same as in our single-level example, just with our additional facets. Since the facets are computed after the query is applied, the counts update to reflect that there are fewer results having drilled down a level. Examine figure 7.5, which shows the facet counts after having drilled down a single level.
// Load Dataset: products.eloader
POST products/_search?pretty=true
{
"query": {
"filtered": {
"query": {
"match": {"_all": "network routing"}},
"filter": {
"and": [
{"range": {"price": {"gte": 0, "lte": 5000}}}
]} }},
"facets":{
"department_id": {"terms": {"field": "department_id"}},
"department_name": {"terms": {"field": "department_name"}},
"price": {"histogram": {"field": "price", "interval": 5000}}} }
Drilling down a second level works similarly. If a user were to select the $0-$50 range, and also select the books department only, we would issue the query in figure 7.6. The query should match exactly one product, Network Routing in Example Land, the only book about network routing that is less than $50, costing $49.90.
// Load Dataset: products.eloader
POST products/_search?pretty=true
{
"query": {
"filtered": {
"query": {
"match": {"_all": "network routing"}},
"filter": {
"and": [
{"term": {"department_id": 1}},
{"range": {"price": {"gte": 0, "lte": 5000}}}
]
} }
},
"facets":{
"department_id": {"terms": {"field": "department_id"}},
"department_name": {"terms": {"field": "department_name"}},
"price": {"histogram": {"field": "price", "interval": 5000}}} }
This chapter has been a tutorial in the basics of standard e-commerce searches with a facet drilldown. There are, however, many variations on this. One case we haven’t covered is searching tags, a topic covered in the next section.
7.3 Searching and Faceting Tags
7.3.1 Tag Drilldown and Faceting
Tags are a common and powerful tool for modeling data. Their flexibility makes them pervasive. While basic use cases for tags have simple implementations, providing richer query options can be quite tricky.
Tags generally have one of two search use cases: they are either used to filter, which is a straightforward task, or they are analyzed and matched against searches using the query, which is more complicated. We’ll cover both types of searches in this tutorial.
To understand our modeling for tags in elasticsearch we’ll quickly recap multi-valued fields. Any field in an elasticsearch document can store either one, or many values, depending on whether the data is set to be either an array, or a single value. Lucene does not actually have a notion of arrays, elasticsearch adds this on top, but it does allow any field to hold multiple values. This is an ideal fit for elasticsearch, where we may have a document with multiple tags. We can have one “tags” field in our document, typed as a string, and then provide multiple values for it.
Multi-valued fields work quite simply when the tag content is mapped as not_analyzed
static tokens. In this case, a single match
query will search the tokens with ease. Working with analyzed tags, however, is much harder. Let’s start by covering the simple case, indexing single token not_analyzed
tags.
We’ll be able to re-use the schema in Figure 7.2. We’ll start by building a drilldown based on the department_name
field, then move on building a free-text search that can work with the department_name_analyzed
field.
7.3.2 Working With Non-Analyzed Tags
Let’s start by looking at the filtering using the department_name
field, which has the not_analyzed
option set. We’ll be loading a new dataset, products_multi_tagged.eloader
, which is very similar to the products.eloader
dataset used previously in this chapter. The main difference in this dataset is the addition of multiple tags per product. Let’s take a look at the new data by issuing the query in figure 7.7.
// Load Dataset: products_multi_tagged.eloader
POST products_multi_tagged/_search?pretty=true
{
"query": {"match_all": {}},
"facets":{
"department_name": {"terms": {"field": "department_name"}}}
}
The responses in the facet section for the query in figure 7.7 should look something like that in figure 7.8. This is a single item from the array of results in the facets for the query. The rest has been omitted for brevity’s sake. Notice how multiple values have been set for the department_*
fields.
{
"department_name": ["Housewares", "Kitchen", "Breakast"],
"department_name_analyzed": ["Housewares", "Kitchen", "Breakast"],
"department_id": 2,
"name": "TCPIntl.CoffeeMaker",
"description": "Deluxe Coffee Maker, made by the trusted folks at TCP International.",
"price": 4999
}
Searching our multi-valued data is an easy task for a match
query. We can simply provide a boolean filter comprised of multiple match
queries as must
clauses. This will ensure that we match all documents that have exactly those tags. There’s a strong reason to use filters instead of queries in these cases: tags should not have any effect on the final score because we’re simply using them to restrict the dataset. Since similarity is based off calculations hinging on tf*idf
scores, this can lead to nonsensically scored results depending on the frequency of filtered terms.
To search for items in both “Housewares” and “Kitchen”, we would issue a query such as that in figure 7.9. This will correctly return the one item we have in both of these departments. Notice that the filter terms match the department names exactly. Alternatively, we could have used the department_id
field which is also available, but we’ll stick to names to make these examples more clear.
// Load Dataset: products_multi_tagged.eloader
POST products_multi_tagged/_search?pretty=true
{
"query": {
"filtered": {
"filter": {
"bool": {
"should": [
{"term": {"department_name": "Housewares"}},
{"term": {"department_name": "Kitchen"}}
] }}}},
"facets": {
"department_name": {
"terms": {"field": "department_name"}}}}
7.3.3 Working With Analyzed Tags
We’ve now seen how it’s possible to perform multi-tag facet drill-downs in the simple case where an exact term for each tag is known. For situations where we want tags considered as part of free-text search, things get trickier. In our previous examples we wanted to use our department_name
field as a pure filter only. However, a search for “Kitchen TCP”, should turn up the coffee maker by TCP Industries first, not books on networking. This is the primary use-case we’ll analyze going forward in this section.
We’re going to solve this problem by using the position_offset_gap
mapping option. If you examine the mapping for the products_multi_tagged
index, you’ll notice that the mapping for department_name_analyzed
has the setting "position_offset_gap": 1000
. This option adjusts the document position data stored alongside terms in Lucene indexes. A position_offset_gap
setting of n
artificially inflates the distance of values in multi valued fields by a given amount. It is equivalent to concatenating all the values into a single value padded by n
stop words. Altering our match
with the phrase
option, and setting the phrase slop
to a number < n + the max length of a value, enables us to isolate matches to individual values within our multi-valued field. Since calculating that value can be tricky, it is recommended to use absurdly large values.
Let’s break down the query in figure 7.10. In this example we’ve changed quite a bit from figure 7.9. We’ve switched away from a filtered
query, and have replaced it with a bool
query.
// Load Dataset: products_multi_tagged.eloader
POST products_multi_tagged/_search?pretty=true
{
"query": {
"bool": {
"should": [
{
"match": {
"department_name_analyzed": {
"type": "phrase",
"slop": 10,
"query": "kitchen tcp"}}},
{"match": {"name": "kitchen tcp"}},
{"match": {"description": "kitchen tcp"}} ]}},
"facets": {
"department_name": {
"terms": {"field": "department_name"}}}}
The intent behind this query is to handle free text input originating from a user facing search input. The user’s query is seen duplicated across all 3 clauses of the boolean query, a task normally done by application code dynamically building the query. A boolean query is used since we need to search across multiple fields with different options. Since each query in the should
clause is a match
query, and since multiple terms within a match
query are matched with a logical ‘or’, we only need at least one term of the search phrase to match. In this case, at least one field must have the term “kitchen” or “tcp”. Documents that have both fields will score higher. Documents with both terms present more frequently and/or closer together will score higher still.
It is for these reasons that the top result is the “TCP Intl. Coffee Maker”. Its high score is due to its being tagged as “Kitchen” and containing the term term “TCP” in its title. Note that the word “Kitchen” is analyzed, as the lowercase “kitchen” still matches. Other documents do match, such as the kitchen spatula, and some TCP related networking products, but they do not rank as highly. This is the bare framework for implementing this kind of search, a full version would probably also include synonyms for “Kitchen” in the department_name_analyzed
field, such as “Cooking”, “Food”, and “Culinary”.