Unfortunately, I no longer have the time to maintain this book which is growing increasingly out of date (esp. with the upcoming Elasticsearch 2.0). I highly recommend checking out Elasticsearch, the Definitive Guide instead. This site will remain up indefinitely to prevent link rot.

Exploring Elasticsearch

A human-friendly tutorial for elasticsearch.

    2.1 Documents and Field Basics

    The smallest individual unit of data in elasticsearch is a field, which has a defined type and has one or many values of that type. A field contains a single piece of data, like the number 42 or the string "Hello, World!", or a single list of data of the same type, such as the array [5, 6, 7, 8].

    Documents are collections of fields, and comprise the base unit of storage in elasticsearch; something like a row in a traditional RDBMS. The reason a document is considered the base unit of storage is because, peculiar to Lucene, all field updates fully rewrite a given document to storage (while preserving unmodified fields). So, while from an API perspective the field is the smallest single unit, the document is the smallest unit from a storage perspective.

    2.1.1 JSON, The Language of Elasticsearch

    The primary data-format elasticsearch uses is JSON. Given that, all documents must be valid JSON values. A simple document might look like the user document in figure 2.1.

    F 2.1 A Simple Document
    {
      "_id": 1,
      "handle": "ron",
      "age": 28,
      "hobbies": ["hacking", "the great outdoors"],
      "computer": {"cpu": "pentium pro", "mhz": 200}
    }
    

    Notice how this document supports all of the features of JSON. The hobbies and computer fields specifically are rich types; an array and an object (dictionary) respectively, while the other fields are simple string and numeric types.

    Elasticsearch reserves some fields for special use. We’ve specified one of these fields in this example: the _id field. A document’s id is unique, and if unassigned will be created automatically. An elasticsearch id would be a primary key in RDBMS parlance.

    While elasticsearch deals with JSON exclusively, internally, the JSON is converted to flat fields for Lucene’s key/value API. Arrays in documents are mapped to Lucene multi-values.

    2.2 Type Basics

    2.2.1 Overview

    Each document in elasticsearch must conform to a user-defined type mapping, analogous to a database schema. A type’s mapping both defines the types of its fields (say integer, string, etc.) and the way in which those properties are indexed (this gets complex, and will be covered later). Types also scope IDs within an index. Multiple documents in an index may have identical IDs as long as they are of different types.

    Types are defined with the Mapping API, which associates type names to property definitions. A minimal version of the mapping for the document from figure 2.1 might look something like figure 2.2.

    F 2.2 A Sample Type Mapping
    {
      "user": {
          "properties": {
              "handle": {"type": "string"},
              "age": {"type": "integer"},
              "hobbies": {"type": "string"},
              "computer": {
                  "properties": {
                      "cpu": {"type": "string"},
                      "speed": {"type": "integer"}}}}}}
    

    2.2.2 Data Types

    Each field in a mapping’s property section is associated with a different core type. The valid core data types in elasticsearch are shown in the table below. In addition to associating fields with data types, a type mapping defines properties such as analysis settings and default boosting values; subjects covered in later chapters.

    Type Definition
    string Text
    integer 32 bit integers
    long 64 bit integers
    float IEEE float
    double Double precision floats
    boolean true or false
    date UTC Date/Time (JodaTime)
    geo_point Latitude / Longitude

    2.2.3 Arrays, Objects, and Advanced Types

    Complex JSON types are also supported by elasticsearch, using both arrays and object notation. Additionally, elasticsearch documents can handle more complex relations, such as parent/child relationships, and a special nested document type.

    Arrays are simple in elasticsearch, since any field, in any document, can hold either one value or multiple values. There is no special array type in an elasticsearch mapping. Hence, when updating a document one may set a string field to store either "foo" or ["foo", "bar", "baz"] when the document is saved. There is nothing to declare regarding a field’s array-ness in the mapping. An important thing to remember, however, is that elasticsearch arrays cannot store mixed types. If a field is declared as an integer, it can store one or many integers, but never a mix of types.

    As seen in figure 2.2, mappings may describe objects which contain other objects, as in the computer field in that figure. The key thing to remember with these objects is that their properties are declared in the properties field in the mapping, and that the type field is omitted.

    It is important to note that sub-objects are still stored in the same physical document as their parent on disk. There is a special nested type in elasticsearch which, while similar, has much different performance and query characteristics due to internally being stored in a separate document. Parent/child documents are similarly complex, both will also be discussed later in this book.

    2.3 Index Basics

    The largest single unit of data in elasticsearch is an index. Indexes are logical and physical partitions of documents within elasticsearch. Documents and document types are unique per-index. Indexes have no knowledge of data contained in other indexes. From an operational standpoint, many performance and durability related options are set only at the per-index level. From a query perspective, while elasticsearch supports cross-index searches, in practice it usually makes more organizational sense to design for searches against individual indexes.

    Elasticsearch indexes are most similar to the ‘database’ abstraction in the relational world. An elasticsearch index is a fully partitioned universe within a single running server instance. Documents and type mappings are scoped per index, making it safe to re-use names and ids across indexes. Indexes also have their own settings for cluster replication, sharding, custom text analysis, and many other concerns.

    Indexes in elasticsearch are not 1:1 mappings to Lucene indexes, they are in fact sharded across a configurable number of Lucene indexes, 5 by default, with 1 replica per shard. A single machine may have a greater or lesser number of shards for a given index than other machines in the cluster. Elasticsearch tries to keep the total data across all indexes about equal on all machines, even if that means that certain indexes may be disproportionately represented on a given machine. Each shard has a configurable number of full replicas, which are always stored on unique instances. If the cluster is not big enough to support the specified number of replicas the cluster’s health will be reported as a degraded ‘yellow’ state. The basic dev setup for elasticsearch, consequently, always thinks that it’s operating in a degraded state given that by default indexes, a single running instance has no peers to replicate its data to. Note that this has no practical effect on its operation for development purposes. It is, however, recommended that elasticsearch always run on multiple servers in production environments. As a clustered database, many of data guarantees hinge on multiple nodes being available.

    We’ll explore some of the basic commands for working with indexes in the next section: Basic CRUD.

    2.4 Basic CRUD

    Let’s perform some basic operations on data. Elasticsearch is RESTish in design and tends to match HTTP verbs up to the Create, Read, Update, and Delete operations that are fundamental to most databases. We’ll create an index, then a type, and finally a document within that index using that type. Open up elastic-hammer, and we’ll issue the following operations from figure 2.3.

    F 2.3 Simple CRUD
    // Create a type called 'hacker'
    PUT /planet/hacker/_mapping
    {
      "hacker": {
        "properties": {
          "handle": {"type": "string"},
          "age": {"type": "long"}}}}
    
    // Create a document
    PUT /planet/hacker/1
    {"handle": "jean-michel", "age": 18}
    
    // Retrieve the document
    GET /planet/hacker/1
    
    // Update the document's age field
    POST /planet/hacker/1/_update
    {"doc": {"age": 19}}
    
    // Delete the document
    DELETE /planet/hacker/1
    
    fig-simplecrud
    // Create an index named 'planet'
    PUT /planet
    

    In the above example, we can finally see the full CRUD (create, read, update, delete), lifecycle in elasticsearch. Now that we can perform some base operations on our data, it’s time for the main attraction, search. Note that the URL scheme is consistent for these operations, with most URLs having the form /index/type/docid, and that special operations on a given namespace are namespaced with an underscore prefix.

comments powered by Disqus