Unfortunately, I no longer have the time to maintain this book which is growing increasingly out of date (esp. with the upcoming Elasticsearch 2.0). I highly recommend checking out Elasticsearch, the Definitive Guide instead. This site will remain up indefinitely to prevent link rot.

Exploring Elasticsearch

A human-friendly tutorial for elasticsearch.

    4.1 Getting Started

    This section explores a common use-case for elasticsearch, searching large bodies of text, such as blog posts, magazine articles, and books. We’ll illustrate natural language searches via Charles Darwin’s famous book on natural selection: On the Origin of Species. This book has been made available for our exercises through the excellent Project Gutenberg. Let’s take a quick look at the schema by running the commands in figure 4.1 after loading the proper dataset.

    F 4.1 On the Origin of Species Schema
    // Load Dataset: darwin-origin.eloader
    GET /darwin-origin/_mapping?pretty=true
    
    // Should return:
    {
      "darwin-origin": {
        "chapter": {
            "numeral": {"type": "string"}, 
            "title": {"type": "string"},
            "text": {"type": "string"} }}}}
    

    There are three fields in the mapping, holding the chapter number as a roman numeral, the chapter’s title, and finally a much larger field containing the chapter’s full text.

    Next, we’ll issue a simple query against this data to become familiar with our indexed documents. Since we’d like to keep our output small, we’ll use the Search API’s fields option for our query, to omit the text field from returned documents. Since each chapter is many pages in length, including the text field would make reading query output cumbersome. Run figure 4.2 to return a list of all the chapters in the book, returning their id, numerals, and titles only.

    F 4.2 Matching Limited Fields
    // Load Dataset: darwin-origin.eloader
    POST darwin-origin/chapter/_search?pretty=true
    {
      "query": {
        "match_all": {}
      },
      "fields": ["_id", "title", "numeral"]
    }
    

    The output of figure 4.2 should include all XIII (13) chapter numerals, titles, and the document ID via the special _id field. It is, in-effect, a chapter listing.

    Now let’s see if we can find which chapter documents contain references to the famous ship that carried Darwin on his voyage, H.M.S. Beagle. Due to the large size of our documents we’ll want the result to show us only the relevant parts of the text of each chapter, near where the word “beagle” appears. For this task we’ll use elasticsearch’s Highlighting API. Try executing the query in 4.3 to see our highlighting search for the Beagle.

    F 4.3 Searching for the Beagle
    // Load Dataset: darwin-origin.eloader
    POST darwin-origin/chapter/_search?pretty=true
    {
      "query": {
       "match": {"text": "beagle"}
      },
      "fields": ["numeral", "title"],
      "highlight": {
        "fields": {"text": {"number_of_fragments": 3}}
      }
    }
    

    Astonishingly, Darwin leaves the Beagle out of most of the book, referencing it by name only once, in Chapter XIII. Looking at our results, we see that Chapter XIII is the only document returned, with the relevant section conveniently highlighted.

    4.2 Searching More Precisely

    The match query we’ve been using can do more than search for single terms like “beagle”. If, for instance, we were to search for a sections of the book that talk about the birds of bermuda, we could simply issue an identical query with “small mammals”. Elasticsearch’s match filter will try and find text where these terms are nearby, scoring sections of text where these terms are closest highest. This is illustrated in figure 4.4.

    F 4.4 Searching for Two Terms (Poorly)
    // Load Dataset: darwin-origin.eloader
    POST darwin-origin/chapter/_search?pretty=true
    {
      "query": {
       "match": {"text": "small mammals"}
      },
      "fields": ["numeral", "title"],
      "highlight": {
        "fields": {"text": {"number_of_fragments": 3}}
      }
    }
    

    The first result’s first highlight should be , for an existing crocodile is associated with many lost <em>mammals</em> and reptiles in the sub-Himalayan deposits. This is not a good match. The text actually contains the exact phrase “small mammals”, but does not appear in any of the highlights. The reason for this is that our query tells elasticsearch to find the documents which contain the words “small” or “mammal” most frequently, then highlight sections with high scores fore both. It is possible to make elasticsearch search require that both terms be present by specifying that the match query use an and operator rather than the default or as in figure 4.5, but that won’t guarantee that the terms are nearby. What we actually want is a search for the phrase “small mammals”.

    F 4.5 Searching for Two Terms with an "And"
    // Load Dataset: darwin-origin.eloader
    POST darwin-origin/chapter/_search?pretty=true
    {
      "query": {
       "match": {"text": {"query": "small mammals", "operator": "and"}}
      },
      "fields": ["numeral", "title"],
      "highlight": {
        "fields": {"text": {"number_of_fragments": 3}}
      }
    }
    

    To search for phrases, use the aptly named match_phrase query, which elasticsearch runs as a Lucene PhraseQuery. Converting our previous query to a search for a whole phrase is quite simple, as seen in figure 4.6. Issuing queries for full phrases you’ll notice that our results contain highlighted portions of the text that actually match the full phrase “small mammals”.

    F 4.6 Searching for a Phrase
    // Load Dataset: darwin-origin.eloader
    POST darwin-origin/chapter/_search?pretty=true
    {
      "query": {
       "match_phrase": {"text": "small mammals"}
      },
      "fields": ["numeral", "title"],
      "highlight": {
        "fields": {
          "text": 
            {"number_of_fragments": 3}
          }
      }
    }
    

    Phrase queries can be quite a bit more complex than figure 4.6. One common adjustment is setting a slop value. Tuning the slop parameter adjusts how large the Levenshtein edit distance between the terms of the query is allowed. Furthermore, the larger this distance, the lower the score will be. The most important thing to understand about the Levenshtein Distance algorithm, perhaps, is that the phrase “small mammals” is 1 edit away from “small furry mammals”, but is 2 edits away from “mammals small”. Readers not familiar with the basics of the Levenshtein Distance algorithm are highly encouraged to skim the wikipedia pagewikipedia page.

comments powered by Disqus