Unfortunately, I no longer have the time to maintain this book which is growing increasingly out of date (esp. with the upcoming Elasticsearch 2.0). I highly recommend checking out Elasticsearch, the Definitive Guide instead. This site will remain up indefinitely to prevent link rot.

Exploring Elasticsearch

A human-friendly tutorial for elasticsearch.

    3.1 The Search API

    The Search API is provided by the _search endpoint available both on /index and /index/type paths. An index search would be at /myidx/_search while a search scoped to a specific document type would be at /myidx/mytype/_search. The job of the Search API is to invoke a query with various parameters such as maximum result set size, result offset location, and a number of performance tuning options. The Search API also provides for both Faceting and Filtering, topics covered in subsequent chapters.

    Figure 3.1 shows a bare bones invocation of the Search API. In the example only the size and query parameters are set, no facets or filters are applied. The query itself is quite simple as well, simply looking for matches for the term “skateboard” in the “hobbies” field. The _search endpoint works with both the GET and POST HTTP methods.

    F 3.1 Simple Query
    {
      "size": 3,
        "query": {
            "match": {"hobbies": "skateboard"}
         }}
    

    The Query DSL is used to determine which documents match given criteria. It also ranks documents in order of their similarity, the proper technical term for describing how close a document matches a query in Lucene. The similarity’s value is usually known as the document’s score. The Query DSL is employed as the contents of either the query key in JSON posted to the _search endpoint as in the example above.

    Let’s try this out by firing up our HTTP client and issuing the request from figure 3.1. The search should return the document for "gondry", the only person in our hacker database who skateboards. The results should look something like figure 3.2. Note that the document appears with an attached _score field, indicating on an arbitrary scale how well the query determined that document matched. Since none of the search terms appeared in our other documents, they are not present, and were not scored.

    F 3.2 Result of a Simple Query
    {
      "took": 2, "timed_out": false,
      "_shards": {"total": 5, "successful": 5, "failed": 0},
      "hits": {
        "total": 1, "max_score": 0.15342641,
        "hits": [
          {
            "_index": "planet", "_type": "hacker",
            "_id": "2", "_score": 0.15342641,
            "_source": {
              "handle": "gondry",
              "hobbies": ["writing reddit comments", "skateboarding"]}}]}}
    

    In the previous example the match query is about as simple as it gets in elasticsearch. Query types, however, are configurable. The match query, for instance, supports a number of options. It can for instance, be configured to search whole phrases by using a more verbose syntax like figure 3.3.

    F 3.3 Using a Match Query to Search for a Phrase
    // Load Dataset: hacker_planet.eloader
    POST /planet/_search
    {"query": {"match": {"hobbies": {"query": "writing reddit comments", "type": "phrase"}}}}
    // Matches gondry who does indeed like to write reddit comments
    

    Another type of query, quite different from the match query is the fuzzy query. This query ranks results according to their Levenshtein distance. Let’s perform a search with the fuzzy query per figure 3.4 and see what we get. We’ll search for the improperly spelled string "skateboarig" this time, to see how well the fuzzy query matches text that isn’t an exact token match.

    F 3.4 A Fuzzy Search
    // Load Dataset: hacker_planet.eloader
    POST /planet/_search
    {"query": {"fuzzy": {"hobbies": "skateboarig"}}}
    // Matches gondry who has "skateboarding" listed as a hobby.
    

    The queries illustrated in this section only begin to cover the options available for querying within elasticsearch, which has, at the time of writing, 36 different Query types, and 25 different Filter types comprising the Query DSL.

    3.1.1 Mixing Queries, Filters, and Facets

    The Search API supports more than just the queries we’ve seen. It also supports filters, facets, sorting, routing to specific shards, and much more. A baroque example of the full Search API might look something like figure 3.5. Don’t worry if you don’t understand everything going on in this example, as most of its content is covered in later chapters. The important thing is to understand that the Search API broadly encompasses a range of features designed to get data out of elasticsearch. We’ll cover the particulars of their use in our later chapters on Faceting and Filtering.

    F 3.5 An Complex Search’s Skeleton
    // Load Dataset: hacker_planet.eloader
    POST /planet/_search
    {
      "from": 0,
      "size": 15,
      "query": {"match_all": {}},
      "sort": {"handle": "desc"},
      "filter": {"term": {"_all": "coding"}},
      "facets": {
        "hobbies": {
          "terms": {
            "field": "hobbies"}}}}
    

    We’re going to take a little break from our study of pulling data out to discuss analysis, the key to breaking down human language.

    3.2 Analysis

    3.2.1 What is Text Analysis?

    Analysis is the secret sauce in elasticsearch’s ability to deal with natural language and other complex data. Elasticsearch has a large toolbox with which we can slice and dice words in order to efficiently searched. Utilizing these tools we can narrow our search space, and find common ground between linguistically similar terms.

    To understand analysis we first must understand that an elasticsearch index is much like a relational one in that efficient lookup is contingent on being able to traverse a tree-like structure looking for a precise match.

    Let’s start by looking at analyzing english language text with the snowball analyzer. The Snowball analyzer is great at figuring out what the stems of english words are. The stem of a word is its root. Let’s see what happens to forms of the word ‘rollerblading’ when run through the snowball analyzer.

    images/figures/analysis

    It can be seen that “rollerblading” has been cut down to its stem, “rollerblad”, and that the same has happened to its cohorts “rollerblader” and “rollerbladed”. The Snowball analyzer is an example of a stemming analyzer, one where words are transformed into their root.

    When a search is performed on an analyzed field, the query itself is analyzed, matching it up to the documents which are analyzed when added to the database. Reducing words to these short tokens normalizes the text allowing for fast efficient lookups. Whether you’re searching for “rollerblading” in any form, internally we’re just looking for “rollerblad‘”.

    The next question that occurs, is wondering where and when analysis occurs once when the data is stored on the documents, and then a second time on each query as it comes in, according to the analysis rules of the fields it is matching against. The process by which documents are analyzed is as follows:

    1. A document update or create is received via a PUT or POST
    2. The field values in the document are each run through an analyzer which converts each value to zero, one, or more indexable tokens.
    3. The tokenized values are stored in an index, pointing back to the full version of the document.

    In this way an efficient inverted index is built up, allowing for exact matches to a query. Since all words are reduced to stemmed tokens, efficient exact and prefix matches can be made against an identically stemmed query. The reason both prefix and exact matches are efficient is identical to the reason those queries are efficient in SQL indexes; traversing values stored in trees is efficient for prefixes and exact matches only.

    3.2.2 The Analyze API

    The easiest way to see analysis in action is with the Analyze API, which lets you test pieces of text against any analyzer. To test the words “candles” and “candle” for instance, against a snowball analyzer, you would issue the query in Figure 3.6.

    F 3.6 Using the Analyze API
    GET '/_analyze?analyzer=snowball&text=candles%20candle&pretty=true'
    

    In this case we’ll be analyzing the string “candles candle” to illustrate how two similar words are analyzed. You should get the result in Figure 3.7.

    F 3.7 Analysis API Output
    {
      "tokens" : [ {
        "token" : "candl",
        "start_offset" : 0,
        "end_offset" : 7,
        "type" : "<ALPHANUM>",
        "position" : 1
      }, {
        "token" : "candl",
        "start_offset" : 8,
        "end_offset" : 14,
        "type" : "<ALPHANUM>",
        "position" : 2
      } ]
    }
    

    Note that both terms have been stemmed to the same root “candl”, and that metadata about offsets, type, and position has been generated. All of this information is encoded within the Lucene index for use while querying. The position offsets in particular are important, as they are used by phrase queries to determine word distances.

    The Analyze API is invaluable when trying to discern why some words tokenize identically and others don’t. If you’re stumped trying to fix a query that just won’t match, make sure you run some of the text through the Analyze API.

    3.2.3 About Custom Analyzers

    There may come a time where the default analyzers provided by elasticsearch are not sufficient for your data. Custom analyzers allow the slicing and dicing of text into specific token streams. They are called for in two situations: 1.) When an analyzer is configurable, and non-default options are needed 2.) When alternate combinations of tokenizers and filters are needed. Before we dive in and build our analyzers, let’s look at what analyzers are made of. An analyzer is really a three stage pipeline comprised of the following execution steps.

    • Character Filtering: Turns the input string into a different string
    • Tokenization: Turns the char-filtered string into an array of tokens
    • Token Filtering: Post-processes the filtered tokens into a mutated token array

    It should be noted that the filtering steps are optional, an analyzer is only required to turn a string into a list of tokens. The simplest form of analysis would be the keyword analyzer which essential is the identity function of analyzers, it simply turns a single string into a single token. This is functionally equivalent to marking a field not_analyzed.

    3.2.4 Building a CSV Analyzer

    Let’s dive in by building a custom analyzer for tokenizing CSV data. Our goal will be to turn a string of the form “Chicken, Salt, Pepper, Bay Leaves” into the the tokens ["chicken", "salt", "bay leaves"]. Our analyzer will consist of a tokenizer that splits up the string on comma boundaries, followed by a token filter that removes whitespace on the end of the tokens, followed by a final token filter that lower-cases the tokens. The pipeline for this custom analyzer is illustrated in figure fig:analyzepipelinecsv.

    images/figures/recipecustomanalyzer

    Custom analyzers can be stored at the index level either during or after index creation. If defined after creation the index must first be closed before creating the analyzer. Let’s create a “recipes” index, close it, update the analysis settings, and reopen it in order to experiment with a custom analyzer. Follow the steps in figure 3.8.

    F 3.8 CSV Recipe Analyzer Tutorial
    // Create the index
    PUT /recipes
    
    // Close the index for settings update
    POST /recipes/_close
    
    // Create the analyzer
    PUT /recipes/_settings
    {
      "index": {
        "analysis": {
          "tokenizer": {
            "comma": {"type": "pattern", "pattern": ","}
          },
          "analyzer": {
            "recipe_csv": {
             "type": "custom",
             "tokenizer": "comma",
             "filter": ["trim", "lowercase"]}}}}}
    
    // Reopen the index
    POST /recipes/_open
    

    Now that the index and analyzer have been created, we can test its operation by using the _analyze API endpoint. By issuing the request in figure 3.9 the analyzer’s operation can be validated. The output should include four tokens, one for each word, just as figure fig:analyzepipelinecsv depicts.

    F 3.9 Checking the CSV Analyzer
    POST /recipes/_analyze?analyzer=recipe_csv
    Chicken, Salt, Pepper, Bay Leaves
    

    Now that we have our custom analyzer, we can test it on real documents. First, a mapping using the analyzer must be defined. Let’s define one per figure 3.10. Note that the custom analyzer has been referenced by name in the analyzer setting for the ingredients field.

    F 3.10 CSV Recipe Analyzer Mapping
    PUT /recipes/recipe/_mapping
    {
        "recipe": {
          "properties": {
             "name": {"type": "string", "analyzer": "default"},
             "ingredients": {"type": "string", "analyzer": "recipe_csv"}}}}
    

    Next, we’ll index some documents per figure 3.11, over which we’ll be able to run queries using our new analyzer.

    F 3.11 CSV Recipe Analyzer Documents
    POST /recipes/recipe/1
    {"name": "Chicken Picatta", "ingredients": "Chicken, Flour, Capers, Lemon, Parsely"}
    POST /recipes/recipe/2
    {"name": "Bolognese Saurce", "ingredients": "Pork, Beef, Tomatoes, Carrots, Celery, Onions, Bay Leaves"}
    POST /recipes/recipe/3
    {"name": "Caprese Salad", "ingredients": "Mozzarella, Basil, Tomatoes"}
    

    Now that our documents are indexed, we can run a search over them per figure 3.12. Note that the results included documents for recipes that contained either “tomatoes” or “mozzarella”. The match query by default executes an or of tokens. Since elasticsearch queries are tokenized using the same analyzer as the field they’re searching, this results in a query that looks for either term. The document scores are generally highest for when both terms are present. Accordingly, the caprese salad should be the first result, as it is the only recipe with both tomatoes and mozzarella.

    F 3.12 CSV Recipe Analyzer Search
    POST /recipes/recipe/_search
    {"query": {"match": {"ingredients": "Tomatoes, mozzarella"}}}
    

    3.3 Ranking Based on Similarity

    Now we get to the crux of the biscuit, what makes elasticsearch, you know… search? Let’s start by thinking of search as two distinct steps. The first step is matching all documents that meet the given criteria. This is known as boolean search in the field of Information Retrieval, because it only specifies which documents are in the result set, and which are excluded. The second step is scoring the documents based on similarity to the query in order to return the documents sorted by descending score. These two steps are the essence of search.

    Elasticsearch queries can be quite complex, especially when combined using the bool query type, or with filters (described later in this book). Multiple query types can be combined into a single query. These subqueries can have their scoring tuned as well, to better balance the scores of the various subqueries. While Elasticsearch queries are quite powerful, and can be rather complex, at the end of the day, executing a query boils down to the task of restricting the result set, scoring each document, then sorting based on that score.

    The main algorithm governing the scoring of documents is what this section seeks to cover in broad strokes. There are multiple, configurable similarity algorithms available. This section seeks to describe the default TFIDF Similarity class. This is the most popular of the scoring algorithms, implemented in Lucene’s TFIDFSimilarity class.

    A full accounting of the algorithm is beyond the scope of this book. If, however, you’re more interested in the workings of the ranking algorithm, the Lucene documentation for TFIDFSimilarity is a good starting point. Is beyond the scope of this book. Understanding the algorithm in concept, however, is not too hard. From an execution standpoint Lucene’s general strategy is to first exclude all documents with no matches for search terms, then rank the documents that do match. A document’s score will be higher when:

    • The matched term is ‘rare’, which is to say that it is found in fewer documents than other terms
    • The term appears in the document at a greater frequency than other terms within the document
    • If multiple terms are in the query and the document contains more of the query’s terms than other documents
    • The field or document had a boost factor specified at index-time or query time

    The preceding list is an extreme simplification of how similarity is calculated. Please keep in mind that not all relationships between scoring factors are linear, and that there are a number of subtleties. Most of the time, however, Lucene does what you want, without requiring developers to the guts of TF/IDF. Additionally, Elasticsearch supports configurable similarity algorithms, such as the bm25 algorithm. If the prepackaged algorithms aren’t working for you, you can either write a script query, which dynamically scores documents according to whatever code you like, or even write a Java plugin implementing your own scoring algorithm.

    Generally, however, TF/IDF, explicit sort orders (like date descending), and script queries, will get you to where you need to go.

    3.4 Faceting

    3.4.1 What is Faceting

    Aggregate statistics are a core part of elasticsearch, and are exposed through the Search API. Facets are always attached to a query, letting you return aggregate statistics alongside regular query results. Consider a user searching for movies by title. Using facets, you could provide aggregate counts of distinct genres within the result-set. If you’ve ever performed a search on an e-commerce site you’ve probably seen these exposed as drill-down options on a sidebar.

    Facets are highly configurable and are somewhat composable. In addition to counting distinct field values, facets can count by more complex groupings, such as spans of time, nest filters, and even include full, nested, elasticsearch queries!

    3.4.2 A Simple Faceting Example

    Let’s take a look at a simple faceting query. We’ll create a database of movies and return facets based on the movies genres alongside standard query results. To run these examples please load the movie_db.eloader data-set into your elasticsearch server.

    The schema we’ll use for our movies index is illustrated in figure 3.13. Since we’ll be faceting the genre field, we have disabled analysis via the "index": "not_analyzed" setting. Disabling analysis ensures the entire genre field is aggregated as a single token. We wouldn’t want the genre “Romance” transformed into “romanc”, for instance, or “Science Fiction” to be aggregated as two separate categories “science” and “fiction”.

    F 3.13 Simple Movie Mapping
    // Load Dataset: movie_db.eloader
    GET /movie_db/movie/_mapping?pretty=true
    {
      "movie": {
        "properties": {
          "actors": {"type": "string", "analyzer": "standard",
                     "position_offset_gap": 100},
          "genre": {"type": "string", "index": "not_analyzed"},
          "release_year": {"type": "integer", "index": "not_analyzed"},
          "title": {"type": "string", "analyzer": "snowball"} }}},
          "description": {"type": "string", "analyzer": "snowball"} }}}
    

    Let’s take a look at faceting in action by running figure 3.14. This query searches for movies with a description containing “hacking”. In addition to returning a list of movies matching those criteria, the query returns a list of facets showing which genres have descriptions containing the term “hacking”, and how often films are in that genre with a matching description.

    F 3.14 Simple Terms Faceting
    // Load Dataset: movie_db.eloader
    POST /movie_db/_search
    {"query": {"match": {"description": "hacking"}},
     "aggs": {
       "genre": {
         "terms": {"field": "genre"}}}}
    

    After running the query, you should see that the standard results, under hits contain the movies “Hackers” and “Swordfish”. You should also see, under the facets section, data similar to figure 3.15.

    F 3.15 Simple Terms Facet Results
    {
      "took": 2,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 2,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "genre": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "Action",
              "doc_count": 2
            },
            {
              "key": "Crime",
              "doc_count": 2
            },
            {
              "key": "Drama",
              "doc_count": 1
            }
          ]
        }
      }
    }
    

    Notice how the counts for each genre are equal to the number of times those genres appear in the result-set. Additionally, you can see that the facets are sorted by frequency, the top facets come first. In a scenario where the facet list is very long you can control how many are returned with the size option for the Terms facet.

    3.5 Filtering

    Using filters effectively is every bit as important as using queries effectively in elasticsearch. They deserve special attention, in-fact, because they have a drastic effect on the execution path of a search. While queries describe which documents appear in results and how they are to be scored, filters only describe which documents appear in results. This can result in a significantly faster query. Additionally, some criteria can only be specified via a filter, no query equivalent exists. Filters can be used to optimize a query by efficiently cutting down the result set without executing relatively expensive scoring calculations. Filters may also be used in the case where a term must be matched against, but contribution to the document’s overall score should be be a fixed amount regardless of the TF/IDF score. Additionally, unlike queries, filters may be cached, leading to significant performance gains when repeatedly invoked.

    Elasticsearch exposes filters in three different ways, which can be somewhat confusing. The three different types control whether the filter is applied to either the query and facets, a query or subquery, or a facet. A disambiguating list is presented below.

    • Queries of the filtered/constant_score type: These are both nested in the query field. Filters here will affect both query results and facet counts.
    • The top-level filter element: Specifying a filter at the root of the search, alongside the query and facet elements, will filter only the query, but not affect facets.
    • Facets with the facet_filter option: Adding each facet supports an optional facet_filter element, which can be used to pre-filter data before being aggregated. This filtering will only affect the facet it is defined in, and will not affect query results.

    The next section of this book will explain each type in detail.

    3.5.1 The Three Different Stages of Filtering

    Using the filtered and constant_score queries

    Most use cases call for a filter of both facets and query results. For these use cases one must construct a search using either the filtered or constant_score queries. Both types allow for a regular query to be nested inside. The filter runs first, followed by the query and any facets, potentially providing a healthy speed boost if the filters restrict the result-set enough. An example of this can be seen in figure 3.16.

    F 3.16 Using a filtered Query
    // Load Dataset: products.eloader
    POST /products/_search
    {
      "facets": {"department": {"terms": {"field": "department_name"}}},
      "query": {
        "filtered": {
          "query": {"match": {"name": "fake"}},
          "filter": {"term": {"department_name": "Books"}}}}}
    

    The filtered query takes two arguments, a query, which is an arbitrary nested query, and a filter to apply before the query part executes. The filter executes first, reducing the result-set significantly before the query executes.

    The constant_score filter is similar to the filter query, except that it takes no nested query argument. Instead, a constant_score query takes a boost argument that is set as the score for every returned document when combined with other queries. The score will always be set as 1 when executing a simple constant_score query as in figure 3.17, as there is no utility in assigning a different score if the numbers stay the same. If, however, a constant_score query is combined with another query, the boost will be activated and used to determine the document’s score. An example of setting the boost to weight two different constant score queries can be seen in figure 3.18.

    F 3.17 Using a constant_score Query
    // Load Dataset: products.eloader
    POST /products/_search
    {
      "facets": {"department": {"terms": {"field": "department_name"}}},
      "query": {
        "constant_score": {
          "boost": 1.5,
          "filter": {"term": {"department_name": "Books"}}}}}
    
    F 3.18 Combining constant_score Queries
    // Load Dataset: products.eloader
    POST /products/_search
    {
      "facets": {"department": {"terms": {"field": "department_name"}}},
      "query": {
        "bool": {
          "should": [
            {
              "constant_score": {
                "boost": 2,
                filter: {"term": {"department_name": "Housewares"}}}},
            {
              "constant_score": {
              "boost": 3,
              "filter": {"term": {"department_name": "Books"}}}}]}}}
    

    The constant_score query’s exclusive use of filters makes for a very fast query since there is no secondary query phase. The catch here is that there’s no way to actually determine a score for the documents, hence the assignation of a fixed score to each document returned via the constant_score query’s boost field. An example can be seen in figure 3.17.

    It should also be noted that a constant_score query can take a query as an argument, in which case the scoring portion of the query will be ignored, and the query is used only as a filter.

    3.5.2 Using the filter element

    The filter element is perhaps the most confusing API in elasticsearch. This confusion arises due to the fact that constant_score and filtered queries are nested inside the top-level query element, yet affect both queries and facets, while the filter element is located at the root of a search, as a sibling to query and facets, yet only affects the query. It is important to keep this in mind, as filtering only the query can cause a large number of documents to be faceted mistakenly with dire performance consequences.

    The content of the filter field is identical to the filter field of the previously seen constant_score and filtered queries, it simply lives at the root of the request’s JSON. An illustration of this can be seen in figure 3.19. Notice how the facets in the results indicate that the there are a total of two documents matching, one in “Electronics”, the other in “Books”, yet the hits field only shows results from the “Books” department as specified by the filter.

    F 3.19 Using the filter Element
    // Load Dataset: products.eloader
    POST /products/_search
    {
      "facets": {"department": {"terms": {"field": "department_name"}}},
      "query": {"match": {"description": "tcp"}},
      "filter": {"term": {"department_name": "Books"}}}
    

    Filtering Facets with a facet_filter

    Facet filters are the opposite of the filter element, they only filter for a single facet, and do not affect query results at all. Note, this is quite different from a filter facet, which is discussed in the next section. A facet_filter facets like a normal facet; it simply ignores documents that don’t match the filter. An example of a facet filter can be seen in figure 3.20.

    F 3.20 Using a facet_filter
    // Load Dataset: products.eloader
    POST /products/_search
    {
      "facets": {
        "department": {
          "terms": {"field": "department_name"},
           "facet_filter": {
             "term": {"department_name": "Books"}}}
      },
      "query": {"match_all": {}}}
    

    The same performance warnings go for a facet_filter as for the top-level filter element. By only filtering facets, there is potential for the query to match a very large number of documents, making for a potentially expensive search.

    Filter Facets

    Once again, the filter API proves confusing. The previous section discussed the use of a facet_filter. In this section, we will cover the use of filter facets, which are entirely different. A filter facet is used to count the number of documents matching both the search query and an optional filter. An example can be in figure 3.21. An easy way to remember when to use filter facets vs facet filters is to remember that filter facets always return a single count, whereas a facet_filter can return multiple counts for different terms matching a filter.

    F 3.21 Using a Filter Facet
    // Load Dataset: products.eloader
    POST /products/_search
    {
      "facets": {
        "books_and_housewares": {
          "filter": {
            "terms": {"department_name": ["Books", "Housewares"]}}}},
      "query": {"match_all": {}}}
    

    3.5.3 Building Filters

    Up to this point, we’ve only used the simplest of the filters, the terms filter, which is essentially identical in concept to the match query, with the notable difference that the analysis phase is skipped. Filters can operate on many different criteria, including such operations as numerical range, inclusion in a geographic area, matching various textual criteria, and compositions of other filters. Filters can be combined using boolean expressions just like queries.

    Subsequent chapters in this book model world problems, and in them there will be examples of more complex filter usage.

comments powered by Disqus