Exploring Elasticsearch

A human-friendly tutorial for elasticsearch.

    1.1 What is Elasticsearch?

    1.1.1 Brass Tacks

    Elasticsearch is a tool for querying written words. It can perform some other nifty tasks, but at its core it’s made for wading through text, returning text similar to a given query and/or statistical analyses of a corpus of text.

    More specifically, elasticsearch is a standalone database server, written in Java, that takes data in and stores it in a sophisticated format optimized for language based searches. Working with it is convenient as its main protocol is implemented with HTTP/JSON. Elasticsearch is also easily scalable, supporting clustering and leader election out of the box.

    Whether it’s searching a database of retail products by description, finding similar text in a body of crawled web pages, or searching through posts on a blog, elasticsearch is a fantastic choice. When facing the task of cutting through the semi-structured muck that is natural language, Elasticsearch is an excellent tool.

    1.1.2 Elasticsearch is Lucene

    The core of elasticsearch’s intelligent search engine is largely another software project: Lucene. It is perhaps easiest to understand elasticsearch as a piece of infrastructure built around Lucene’s Java libraries. Everything in elasticsearch that pertains to the actual algorithms for matching text and storing optimized indexes of searchable terms is implemented by Lucene. Elasticsearch itself provides a more useable and concise API, scalability, and operational tools on top of Lucene’s search implementation.

    Lucene is old in internet years, dating back to 1999. It’s also exceedingly popular and proven. Lucene is used by untold numbers of companies, running the gamut from huge corporations such as Twitter, to small startups. Lucene is proven, tested, and is widely considered best-of-breed in open-source search software.

    Most of the mental effort users of elasticsearch devote to the task of search will be related to using the Lucene APIs elasticsearch exposes. Accordingly, this book spends more time covering Lucene via elasticsearch than anything else.

    1.1.3 The Value Add

    While Lucene is a fantastic tool, it is cumbersome to use directly, and provides few features for scaling past a single machine. Elasticsearch provides a more intuitive and simple API than the bare Lucene Java API. Crucially, it also provides an infrastructure story that makes scaling across machines and data centers relatively simple. Let’s look at some of the features elasticsearch brings to the table vs. bare Lucene:

    • A simpler API
    • Interoperation with non-Java/JVM languages
    • Operational ease of use
    • Clustering and replication
    • Good defaults for complex Lucene classes

    1.1.4 What Problems does Elasticsearch Solve Well?

    There are myriad cases in which elasticsearch is useful. Some use cases more clearly call for it than others. Listed below are some tasks which for which elasticsearch is particularly well suited.

    • Searching a large number of product descriptions for the best match for a specific phrase (say “chef’s knife”) and returning the best results
    • Given the previous example, breaking down the various departments where “chef’s knife” appears (see Faceting later in this book)
    • Searching text for words that sound like “season”
    • Auto-completing a search box based on partially typed words based on previously issued searches while accounting for mis-spellings
    • Storing a large quantity of semi-structured (JSON) data in a distributed fashion, with a specified level of redundancy across a cluster of machines

    It should be noted, however, that while elasticsearch is great at solving the aforementioned problems, it’s not the best choice for others. It’s especially bad at solving problems for which relational databases are optimized. Problems such as those listed below.

    • Calculating how many items are left in the inventory
    • Figuring out the sum of all line-items on all the invoices sent out in a given month
    • Executing two operations transactionally with rollback support
    • Creating records that are guaranteed to be unique across multiple given terms, for instance a phone number and extension

    Elasticsearch is generally fantastic at providing approximate answers from data, such as scoring the results by quality. While elasticsearch can perform exact matching and statistical calculations, its primary task of search is an inherently approximate task. Finding approximate answers is a property that separates elasticsearch from more traditional databases. That being said, traditional relational databases excel at precision and data integrity, for which elasticsearch and Lucene have few provisions.

    1.2 Up and Running

    Let’s setup the tools we’ll need to work with elasticsearch and run the tutorials in this book. Since elasticsearch is a standalone Java application, getting up and running is a cinch on almost any platform. You’ll want Java 1.7 or newer. You can check this by running java -version at a command prompt. If you don’t have Java installed on your system, download the JDK now.

    1.2.1 Installing Elasticsearch

    Elasticsearch is a standalone Java app, and can be easily started from the command line. A copy can be obtained from the elasticsearch download pageelasticsearch download page.

    On Linux

    If you’re running a linux distribution I recommend installing either the .deb or .rpm version of the software. Once installed, you can run /etc/init.d/elasticsearch start to start the server as a daemon in the background.

    Mac OSX and Generic UNIX/Linux

    Download the .zip version and unzip it to a folder. Open a terminal, and navigate to the bin directory, and run the executable elasticsearch file within, by typing in ./elasticsearch at the terminal.

    Microsoft Windows

    Download the .zip version and unpack it to a folder. Navigate to the bin folder, then double click elasticsearch.bat to run.

    Check if your Server is Running

    After you’ve started your server, you can ensure it’s running properly by opening your browser to the URL http://localhost:9200. You should see a page that looks something like Figure 1.1.

    F 1.1 A Sample Response
    {
      "ok": true,
      "status": 200,
      "name": "Psyche",
      "version": {
        "number": "0.90.0",
        "snapshot_build": false
      },
      "tagline": "You Know, for Search"
    }
    

    1.2.2 Loading Datasets

    Some of the code samples depend on pre-built datasets being loaded into your elasticsearch server. To run those examples you’ll need a copy of the ee-datasets repository available at https://github.com/andrewvc/ee-datasets. This repository changes frequently as the book is still in the process of being written, so be sure to check for updates frequently via git pull. If you’d like to obtain the datasets without using git, download the zip archive at {https://github.com/andrewvc/ee-datasets/archive/master.zip.

    After you’ve cloned the repository you’ll be able to load examples into your server by executing the included elastic-loader.jar program, providing the address of your elasticsearch server, and the path to the data-file. To load the movie_db dataset, for example, open a command prompt in the ee-datasets folder, and run:

    java -jar elastic-loader.jar http://localhost:9200 datasets/movie_db.eloader
    

    1.2.3 Running Examples

    This book makes exclusive use of elasticsearch’s most popular API, its JSON HTTP API. All examples in this book are implementation independent descriptions of HTTP requests. Feel free to use any tool you wish. While you’re free to use the client of your choice it’s recommended to use the free Elastic Hammer tool to query elasticsearch. I developed this tool specifically to aid in the teaching of this book, and its fairly simple to use. Other tools such as cURL, should work fine as well.

    HTTP operations in this book are described with a simple syntax consisting of the HTTP verb (GET, PUT}, POST, HEAD, DELETE), URL path, and request body. An example simple request can be seen in figure 1.2.

    F 1.2 A Simple Request
    GET /_status
    

    Some operations require including a body in the request. When including a body, the body contents will appear directly below the method and path as in figure 1.3.

    F 1.3 Request with a Body
    POST /_analyze?analyzer=snowball
    Rollerblades
    

    The previous example would issue a POST request to /_analyze?analyzer=snowball with an HTTP Body set to "rollerblades" if executed.

    Examples requiring the use of a specific data-set will have that dataset’s name provided as a comment on the first line of the example. For instance:

    F 1.4 Request with a Dataset
    // Load Dataset: movie_db.eloader
    POST /movie_db/_search?pretty=true
    {"query": {"match": {"_all": "story"}}}
    

    Before running you would need to ensure the data-set movie_db.eloader was loaded into your local elasticsearch server. Please read section 1.2.2 for further information on seeding your database with .eloader files.

comments powered by Disqus