Unfortunately, I no longer have the time to maintain this book which is growing increasingly out of date (esp. with the upcoming Elasticsearch 2.0). I highly recommend checking out Elasticsearch, the Definitive Guide instead. This site will remain up indefinitely to prevent link rot.

Exploring Elasticsearch

A human-friendly tutorial for elasticsearch.

    8.1 Routing

    8.1.1 How Elasticsearch Routing Works

    Understanding routing is important in large elasticsearch clusters. By exercising fine-grained control over routing the quantity of cluster resources used can be severely reduced, often by orders of magnitude.

    The primary mechanism through which elasticsearch scales is sharding. Sharding is a common technique for splitting data and computation across multiple servers, where a property of a document has a function returning a consistent value applied to it in order to determine which server it will be stored on. The value used for this in elasticsearch is the document’s _id field by default. The algorithm used to convert a value to a shard id is what’s known as a consistent hashing algorithm.

    Maintaining good cluster performance is contingent upon even shard balancing. If data is unevenly distributed across a cluster some machines will be over-utilized while others will remain mostly idle. To avoid this, we want as even a distribution of numbers coming out of our consistent hashing algorithm as possible. Document ids hash well generally because they are evenly distributed if they are either UUIDs or monotonically increasing ids (1,2,3,4 …).

    This is the default approach, and it generally works well as it solves the problem of evening out data across the cluster. It also means that fetches for a single document only need to be routed to the shard that document hashes to. But what about routing queries? If, for instance, we are storing user history in elasticsearch, and are using UUIDs for each piece of user history data, user data will be stored evenly across the cluster. There’s some waste here, however, in that this means that our searches for that user’s data have poor data locality. Queries must be run on all shards within the index, and run against all possible data. Assuming that we have many users we can likely improve query performance by consistently routing all of a given user’s data to a single shard. Once the user’s data has been so-segmented, we’ll only need to execute across a single shard when performing operations on that user’s data.

    8.1.2 Implementing Custom Routing

    Specifying routing information in elasticsearch is fairly straightforward. There are two strategies that may be employed: 1.) routing based on a certain document field’s value 2.) routing based on a value passed at index time via the routing query parameter. Queries must always include the routing query parameter to take advantage of explicit routing. We’ll start with the first approach, as it is simpler.

    An important best practice is to ensure that elasticsearch’s default routing is disabled for our index. Once custom routing is in play it should be used exclusively in order to preserve developer sanity. Disabling the default ID based routing can be achieved through the mapping API and will be enforced in subsequent examples.

    Let’s use a micro-blogging platform as our example in this exercise. We’ll call this fictional microblogging platform microblagh. The case that will be optimized for will be searches scoped to a single user. By routing based on the user_id value we’ll be able to make searches limited to a single user’s posts much faster. To start, we must define an index and a mapping type post, with _routing.required set to true. We’ll also specify _routing.path, which tells elasticsearch which field value to look at for routing. Run the code in figure 8.1 to get started. With this setting enabled, attempts to index documents without explicitly specified routing will fail with an error.

    F 8.1 Schema With Routing Enforcement
    POST /microblagh
    {"mappings":
      {"post":
        {"_routing": {"required": true, "path": "user_id"},
         "properties": {
           "user_id": {"type": "integer"},
           "username": {"type": "string"},
           "text": {"type": "string"}}}}}
    

    Creating documents in our schema is straightforward enough, and nothing special need be done. In order to test our routed index, let’s populate it with a few sample documents per figure 8.2.

    F 8.2 . Sample Data for Microblagh
    POST /_bulk
    {"index": {"_index": "microblagh", "_type": "post", "_id": 1}
    {"user_id": 1, "username": "albert", "text": "all work, and no play"}
    {"index": {"_index": "microblagh", "_type": "post", "_id": 2}
    {"user_id": 1, "username": "albert", "text": "make jack a dull boy"}
    {"index": {"_index": "microblagh", "_type": "post", "_id": 3}
    {"user_id": 2, "username": "john", "text": "Microblagh 4eva"}
    {"index": {"_index": "microblagh", "_type": "post", "_id": 4}
    {"user_id": 1, "username": "stacy", "text": "Microblagh is overrated"}
    

    We can test that documents have been properly routed by running operations with the routing query parameter set. For instance, post id 3 should be routed based on its user_id value of 2. To retrieve it, we would run GET /microblagh/post/3. To retrieve it, explicitly routing the request only to the shards it actually exists on, GET /microblagh/post/3?routing=2 can be run. Finally, we can verify the properties of routing by running GET /microblagh/post/3?routing=1, which should return a 404 error, as we have incorrectly specified the routing value.

    The same effects can be seen with queries as well. Searching for the text Microblagh, across the entire index should return documents 3 and 4, by stacy and john. Since they have different routing values, however, only one is returned if routing is set, as illustrated in figure 8.2.

    F 8.3 Searching with Routing
    // Returns posts 3 and 4
    POST /microblagh/post/_search
    {"query": {"match": {"text": "Microblagh"}}}
    
    // Only returns 3
    POST /microblagh/post/_search?routing=2
    {"query": {"match": {"text": "Microblagh"}}}
    
    // Returns both once again
    POST /microblagh/post/_search?routing=2,1
    {"query": {"match": {"text": "Microblagh"}}}
    

    8.1.3 Performance Concerns

    It should be noted that to write a search for a single user, we would still need to have a filter on our query scoping it to that user specifically. Routing should not be used to limit logical results of data, since changing the number of underlying shards will change which data is visible for a given routing value. The cardinality of routing values should generally be a large multiple of the number of shards to enable balanced data. This ensures that both distribution of data and computation is even across the cluster.

    Regarding the choice of using field values for routing, consideration should be given to explicitly specifying routing values at index time via the routing query parameter rather than relying on field values. A small speed increase can be achieved when using the query parameter rather than as a field value. At scale this difference could be non-negligible.

    8.2 Clustering and Index Internals

    It’s almost impossible to truly understand elasticsearch’s clustering features without also understanding the internals of Lucene indexes. The elastic part of elasticsearch’s name is in reference to its clustering capabilities. It’s capacity to seamlessly add and remove servers and expand capacity. While Lucene does have some provisions for operations on multiple servers, they are quite basic, and not nearly as scalable.

    Elasticsearch’s clustering features can be neatly divided into two categories, durability and scalability. Before we explore these topics however, let’s get dive into the internals of a Lucene index.

    8.2.1 Index Internals

    Indexes are the central piece of data to which computation is applied in elasticsearch. Everything involving data in elasticsearch occurs at the index level. While complex, there are a few things about the internals of elasticsearch indexes that are quite useful to know. A cursory knowledge of the implementation and architecture of elasticsearch indexes, becomes important when considering clustering, capacity planning, and performance optimization.

    Let’s start by discussing Lucene indexes, upon which elasticsearch indexes are built. A Lucene index is subdivided into a variable number of segments at any given time. Each of these segments is a completely separate index in and of itself. Lucene indexes create more segments as documents are added, and when they become more numerous tries to merge them back into fewer segments. The smaller the number of segments, the faster operations run (one segment is optimal). Merging, however, has costs as well, so Lucene attempts to merge segments at a rate where merge costs and search efficiency are balanced.

    It is due to this architecture that searches be seamlessly executed over multiple indexes. A search over 2 indexes with 1 segment apiece is almost identical to a search over 1 index with 2 segments.

    Elasticsearch takes Lucene index/segment symmetry one step farther, leveraging Lucene’s ability to span operations over indexes to implement its clustering support. When we speak of an index in elasticsearch, we are usually talking about elasticsearch’s index abstraction which sits atop multiple Lucene indexes. Each elasticsearch index is divided into a number of shards (5 by default). Each of these shards contains a unique portion of the documents in the elasticsearch index.

    A shard itself is a single logical index, but is comprised of a number of Lucene indexes–a primary and a configurable number of replicas–all of which contain the same documents, but are full Lucene indexes in and of themselves. Multiples are created to allow for both durability guarantees and distributed search scalability across clusters.

    When a document is added to the elasticsearch index, it is routed to the proper shard based on its id. When a search is executed it is run in parallel over all the shards in an index (on either a primary, or replica Lucene index), and then the results are combined. This means that splitting your documents over one elasticsearch index with 5 shards is equivalent to manually splitting your data over 5 elasticsearch indexes with one shard.

    Analysis in elasticsearch has no special magic to it, it mostly relies on the same principles underlying efficient data storage and retrieval in traditional relational systems. That is to say, it is contingent on efficient traversal of sorted trees.

    In a relational database this is accomplished by creating an index for a column in a table. Indexes typically take the form of a quickly traversable copy of the column in binary-tree form. These tree structures make both lookups and other tree operations—obtaining a range based subtree for instance—very fast.

    Elasticsearch operates on the same principle, however indexes are not explicitly created, all values are indexed into trees. Furthermore, analyzers can be employed to allow for efficient lookup on pre-processed versions of this data.

    8.2.2 Index Durability

    Durability in elasticsearch is implemented by its replica feature, whereby data is mirrored to multiple servers simultaneously. By default elasticsearch indexes have a replica count value set to 1. This means that each piece of data will exist on at least 2 servers in a running elasticsearch cluster, once on a primary, and once on a secondary location. Upping the replica count to 4 would mean that same piece of data would be guaranteed to exist on at least 5 separate servers.

    Should a server fail, elasticsearch will self-heal. Given a replica count of 1, and a cluster consisting of 3 servers, it will be the case that each server will have 2/3 of the cluster data available. In this scenario the loss of a single server will be tolerated without data-loss. If a single server in that setup were to fail the cluster state, visible at the cluster health endpoint /cluster/_health would change from green to yellow. Some data on the cluster would only be present on a single server, which would cause elasticsearch to attempt to re-balance the replicas, dividing replica indexes evenly between the 2 remaining servers. Should the third server be fixed and added back in, elasticsearch would re-migrate the data back across all 3 servers. If two of the three servers were to fail, the state would change from yellow to red

    8.2.3 Write Durability

    Write consistency in elasticsearch may be specified per write-request, or on a per-node basis with the action.write_consistency setting. In an elasticsearch cluster one may write data at one of three consistency levels: all, quorum, and one, with decreasing guarantees for data-durability.

    The one consistency level is easy enough to understand: a single node will receive the data and persist it before acknowledging the write. After that point, the data will eventually be replicated to all replicas of the shard. The all consistency is similarly simple; each and every replica in the shard acknowledges the write before a response is returned. Lastly, when using the default quorum consistency level a majority of shards within the cluster must acknowledge the write before a response is returned. The various stages of a write, and the points at which each consistency is achieved is illustrated in the figure below.

    images/figures/replicationconsistency

    Each consistency level has its uses. The one consistency level will return most quickly, followed by the slower quorum consistency level, and finally the all consistency level, which is slowest of all. For most applications the quorum consistency level is a good trade-off between safety and performance.

    Consistency levels can be specified during write operations. For example, to set the consistency during an update, we would run a query similar to the one in 8.4.

    F 8.4 Updating a Document With An Explicit Consistency Level
    fig-consistencyupdate

    .

    PUT /atest/atype/adocument?consistency=one
    {"afield": "avalue"}
    

    If occasional losses of data during unusual scenarios–such as cluster maintenance and the occasional outage–are okay, then a consistency level of one may be sufficient. If a node goes down soon after being written to there is a chance that it may not have had time to replicate its most recent writes. In this case data loss may occur. The quorum consistency level is quite safe given a sufficient replica count. With a replica count of two, for instance, there is a total of three shards. Data will then have to be present on both the primary shard and at least one of the two replica shards before returning. When running with either a low replica count or in situations where data integrity is paramount the all consistency setting may be appropriate. If your writes are slow consider either lowering the consistency level or increasing the number of machines/shards in your cluster.

    If your primary data-store is not elasticsearch it is almost certain that the all write-level is overkill. The performance impact will likely not be worth the extra guarantees of durability. As a final point, it should be noted that during a split-brain failure, where a single cluster has divided into two due to either a network partition or overloading, write-consistency guarantees will be rendered moot for obvious reasons.

comments powered by Disqus