Unfortunately, I no longer have the time to maintain this book which is growing increasingly out of date (esp. with the upcoming Elasticsearch 2.0). I highly recommend checking out Elasticsearch, the Definitive Guide instead. This site will remain up indefinitely to prevent link rot.

Exploring Elasticsearch

A human-friendly tutorial for elasticsearch.

    5.1 Formulating an Approach

    Many search applications involve searching human language data that language analyzers cannot handle. Examples of this come in the form of things like usernames (“OoKissezoO”, “UltraCaesar52”), product/brand names (“Sony rx100”, “Directv”), and gibberish like log data. This sort of data requires one of two strategies: 1.) Find or build a suitable analyzer for the domain 2.) Index the data without analysis, treating the text as opaque tokens. The second strategy is quite straightforward, as it merely requires setting "index": "not_analyzed" in the mapping for that column, so this chapter will primarily discuss using and building analyzers.

    A good starting point for this topic is the common problem of searching usernames. Often times usernames defy stemming analysis. There is no appropriate stem for the username “OoKisseZoO”. The simplest way to deal with this sort of data is to skip analysis altogether, which at least makes exact matches work. This is a fine, and possibly even desired strategy for many cases. Expert users, for instance, may only want to retrieve exact matches. But what about finding similar usernames, handling typos, and people merely searching for ‘kissez’? To do that we’ll try out a couple different algorithms.

    The schema for our username database will employ Elasticsearch’s fields option in its mapping, allowing us to store our username data once, but index twice using multiple analyzers. There is a fairly common case where you’ll only want to store a piece of data once, but will want to analyze it multiple types for different types of searches. By using the fields option this can be accopmlished.

    We’ll index the data once with minimal analysis, only bothering to lowercase the text as user.exact, using elasticsearch’s simple analyzer. We’ll store the username data a second time under user.ngram using a custom ngram analyzer. We’ll also store the user’s real name analyzed with a simple whitespace analyzer. The mappings and analyzer settings should look something like figure 5.1.

    F 5.1 Username Mappings & Settings
    GET /users/_settings
    //result
    {
      "users": {
        "settings": {
          "index": {
            "uuid": "SuuLL6iTTU2jYDlRW6jrQg",
            "analysis": {
              "analyzer": {
                "usernameNgram": {
                  "type": "custom",
                  "filter": "lowercase",
                  "tokenizer": "customNgram"
                }
              },
              "tokenizer": {
                "customNgram": {
                  "min_gram": "3",
                  "type": "nGram",
                  "max_gram": "5"
                }
              }
            },
            "number_of_replicas": "1",
            "number_of_shards": "5",
            "version": {"created": "1000199"}
          }}}}
    
    // Load Dataset: usernames.eloader
    GET /users/_mapping
    //result
    {
       "users": {
          "mappings": {
             "user": {
                "properties": {
                   "real_name": {"type": "string"},
                   "username": {
                      "type": "string",
                      "fields": {
                         "exact": {"type": "string", "analyzer": "simple"},
                         "ngram": {"type": "string", "analyzer": "usernameNgram"}
                      }}}}}}}
    

    5.2 A Simple Username Search Using Ngrams

    As described earlier, one of our main concerns for a search over usernames is handling misspellings. Elasticsearch’s ngram analyzer gives us a solid base for searching usernames. The ngram analyzer splits groups of words up into permutations of letter groupings. We can learn a bit more about ngrams by feeding a piece of text straight into the analyze API. We’ve defined a custom ngram analyzer on our index, set to group text into a sliding window of all possible 3, 4, and 5 letter permutations.

    Let’s test it against the username “cmdrtaco”, as in figure 5.2. Notice how we’re referencing the analyzer usernameNgram defined in the settings portion of figure 5.1. Since we’re using an index specific analyzer, we make sure we’re using the analyzer endpoint for this specific index, at /users/_analyze.

    F 5.2 A Simple Username Search using Ngrams
    // Output should look like what you see below, but longer
    {
      "tokens": [
        {"token": "cmd", "start_offset": 0, "end_offset": 3, "type": "word", "position": 1},
        {"token": "cmdr", "start_offset": 0, "end_offset": 4, "type": "word", "position": 2},
        {"token": "cmdrt", "start_offset": 0, "end_offset": 5, "type": "word", "position": 3},
        {"token": "mdr", "start_offset": 1, "end_offset": 4, "type": "word","position": 4},
        {"token": "mdrt", "start_offset": 1, "end_offset": 5, "type": "word", "position": 5},
        ...]}
    
    // Load Dataset: usernames.eloader
    GET users/_analyze?text=cmdrtaco&analyzer=usernameNgram
    

    We’ll be able to couple this ngram indexed data with a simple match query to find users, as seen in figure 5.3. Since the default for a match query is to combine the terms produced by analysis, of which there are many when using an ngram, as a boolean or, we’ll be seeing results that don’t always match exactly, but are ranked showing the best match first. Ngrams can produce stunningly poor quality results, but they also tend not to waste much time for users since it’s usually pretty obvious by the top result what the quality of the results is.

    F 5.3 A Simple Ngram Based Search
    // Load Dataset: usernames.eloader
    POST users/user/_search
    {"query": {"match": {"username.ngram": "linus"}}}
    

    5.2.1 Why Ngrams are a Good Choice for Usernames

    From a results perspective, it is advantageous that ngrams don’t attempt language heuristics such as stemming that don’t apply to strings like “cmdrtaco”. They also handle mis-spelled terms well, since a search just needs to have a plurality of matches of sub-parts of a given term. Contrast this with the snowball analyzer, which stems words to a single term, meaning only exact matches work, even if synonyms are present.

    When it comes to performance, ngrams allow developers to trade storage for CPU as compared to a fuzzy query, another choice for these sorts of queries. A fuzzy query will find mis-spelled terms by calculating the Levenshtein distance of all specified terms from the full terms list in the DB. Ngrams, on the other hand, have a generally more performant strategy based on storing a large number of “gram” tokens in an index. match queries over ngrams are fast because they are essentially big boolean should clauses that have exact hits against specific grams. Contrast this with the fuzzy query, which must sequentially scan all terms and calculate the levenshtein difference with respect to all terms in query. That said fuzzy queries can be quite useful for datasets where the total number of terms is low, or its prefix optimization can be employed without significantly degrading results.

comments powered by Disqus