Unfortunately, I no longer have the time to maintain this book which is growing increasingly out of date (esp. with the upcoming Elasticsearch 2.0). I highly recommend checking out Elasticsearch, the Definitive Guide instead. This site will remain up indefinitely to prevent link rot.

Exploring Elasticsearch

A human-friendly tutorial for elasticsearch.

    Chapter 6 Searching the Wikipedia Dataset

    This chapter will cover more advanced searches using Wikipedia’s English articles as a data set. These articles comprise some 4.3 million documents. Downloading, parsing, and indexing this many documents can take quite a bit of time and computational power. The dataset takes up 9.5 GiB as a compressed bzip2 file, and more than 45 GB of space when indexed in elasticsearch. Processing the dataset is time intensive as well; Indexing the dataset took many hours on a beefy m3.xlarge EC2 instance.

    To ease this task I’ve prepared a virtual machine image containing a running elasticsearch instance with the full Wikipedia data set loaded. I recommend booting up an m1.medium instance, with 1 core and 3G of ram. You may use a cheaper instance, but be sure to reconfigure the ES_HEAP_SIZE variable in /etc/init.d/elasticsearch. If you already have an Amazon Web Services (AWS) account, simply click the following link, or visit http://bit.ly/18YPYCe. I recommend using a c1.medium (High CPU Medium) instance When using the AMI. Be sure to change your EC2 instance’s security group policies to allow port 9200 TCP for elasticsearch access. When using the EC2 image, be sure to check the state of the cluster at _cluster/health. It may come up initially with a red status. If that happens, simply wait 5-10 minutes for the cluster to scan the indices and recover to normal function. If you’d rather load the data onto your own hardware please follow the instructions on the README for the andrewvc/wikiparse github repository.

    6.1 Working With the Wikipedia Schema

    Before we dive into building a search engine I’d like to discuss the process around creating the dataset. It’s of particular interest to those working with importing large amounts of data. In this section, the workflow of Wikiparse, the program I wrote to import the Wikipedia data, will be covered.

    Wikiparse takes the BZ2 compressed XML dump from Wikipedia and constructs a pipeline that sends that data to an elasticsearch instance. The program takes the form of a pipeline, each part feeding into the next, each executing ‘lazily’, only doing partial work, say decompressing a small segment of the BZ2 data, or deserializing a section of the XML, before letting the downstream part do some work. In this way memory requirements can stay relatively low, not having to hold the entire uncompressed XML representation in memory. The pipeline has the following stages.

    1. Decompress the BZ2 file
    2. Parse the XML as a stream
    3. Iterate through Wikipedia <page> elements in the XML
    4. Filter out elements that are not full articles or redirects
    5. Format the article as an update with upsert requests to elasticsearch
    6. Accumulate batches of these updates, then send using the bulk API

    The update with upsert stage is perhaps the most interesting part of the process. A standard create is not suitable for the Wikipedia data as the data we want is broken up across the file, and elasticsearch documents must be incrementally built up from multiple records in the source dataset. Wikipedia articles each have a number of redirects, stub articles that will, if you visit them, simply redirect you to a full article. For example, there is a “WWII” redirect that points you to “World War II.” These redirects are not nested children of the article they point to, rather they are represented with the same <page> elements as the full articles in the data dump XML. The are, however, distinguishable via a special <redirect> tag. Representing this data in elasticsearch using separate documents for redirects is problematic for a number of the searches we’ll implement. We really want all the data about an article, including its redirects, in one document. The names of redirects are valuable search related information, and we want that information nested directly within our elasticsearch index. To that end, the schema for our “pages” index is as shown in figure 6.1.

    F 6.1 Abridged Wikipedia Document Mapping
    GET /en-wikipedia/page/_mapping
    {
      "page": {
        "properties": {
          "body": {
            "type": "multi_field",
            "fields": {
              "body_simple": {
                "type": "string", "analyzer": "simple"},
              "body_snow": {
                "type": "string", "analyzer": "snowball"}}},
          "redirects": {
            "type": "multi_field",
            "fields": {
              "title_exact": {
                "type": "string", "index": "not_analyzed"},
              "title_simple": {
                "type": "string", "analyzer": "simple"},
              "title_snow": {
                "type": "string", "analyzer": "snowball"}}},
          "suggest": {
            "type": "completion",
            "index_analyzer": "simple",
            "search_analyzer": "simple",
          },
          "title": {
            "type": "multi_field",
            "fields": {
              "title_exact": {
                "type": "string", "index": "not_analyzed"},
              "title_simple": {
                "type": "string", "analyzer": "simple"},
              "title_snow": {
                "type": "string", "analyzer": "snowball"}}}}}}
    

    Observe that the redirects field is represented as a string field. We can store an array of redirects titles in it. Here is where some fancy juggling comes into play. We have a 9.5G source dump of wikipedia data, we can’t grab all the data related to a single article and then create a document based on that using a single insert, because a bzipped XML file has no random access, and no indexing that groups articles and redirects. We could, of course, load it into an SQL DB or a some other system that groups the data in a first pass before loading it into elasticsearch, but then this is pretty wasteful, we already have a decent database in elasticsearch. Instead, what we do use the update statement in figure 6.2 to incrementally build the documents as we scan through the dataset.

    F 6.2 Incrementally Updating Wikipedia Pages via Update
    POST /en-wikipedia/page/DOCID/_update
    {"script": "if (is_redirect) {ctx._source.redirects += redirect};
                ctx._source.suggest.input += title;
                if (!is_redirect) { ctx._source.title = title; ctx._source.body = body};",
     "params": {"redirect": "Page Title", 
                "title": "Redirect Target or Page Title",
                "is_redirect": true or false},
     "upsert": {"title": "Redirect Target or Page Title",
                "redirects": [redirect] or [],
                "suggest": {"input": ["Page Title"],
                            "output": "Redirect Target or Page Title"}
                "body": "Body text if not redirect"}}
    

    This update statement is complicated to be sure, but breaking it apart yields understandable behavior. Its essential goal is to take whatever redirect or full article data we have and put the right data in the right places. The easiest case to understand is the case where no document currently exists. The upsert field handles this case, inserting its value as the document, and ignoring the script and parameters fields when no existing document with the given DOCID is present. When upserting a redirect we generate a partial document, filling in the title (which we know as that is what the redirect points to), and a single redirect, but leaving the body blank. When upserting a full article we fill in everything, and use an empty array for the redirects field. In the case where a document already exists the upsert field is ignored, and the script is executed, which will either append a redirect if processing a redirect, or fill in the body if processing the full article.

    From a performance standpoint, the process as described has one major issue. If a redirect is indexed after the full article has been indexed the full article text must be re-indexed, remember there are no truly partial updates in elasticsearch. Any update to a document means deleting the previous version and adding a new version. When performing small numbers of updates this effect is not noticeable, but when performing a bulk data load on a dataset the size of Wikipedia this effect is significant. To this end, the wikiparse code scans the bzipped dataset twice, on the first pass loading in the redirects creating stub articles with a full redirect list, on the second pass loading in the full article text.

    6.1.1 A Plan of Attack

    Designing a search algorithm without an idea of the sort of queries it will receive is not a smart way to work. We’ll start our process by writing a simple manual test plan before we write any code. This will take the form of a simple grid of query text and expected results. While access to real-world usage data for the search feature on Wikipedia.com isn’t accessible, we can come up with some reasonable guesses for representative queries. I’ve put these in the table below.

    Query Text Expected Results
    Alexander Graham Bell Exact article/title match
    Alexander Bell Partial title match
    Step Dances Should stem “Dancing” and match “Step Dance”
    International Incident Articles mentioning international incidents
    World’s Most Populous Cities Matches “World’s largest Municipalities by Population” article in top results
    British Trade “East India Company”, and matches various other articles on the trade history of England

    This set of test data has been carefully chosen based on the different strategies required for each search term. The “Alexander Graham Bell” query is likely the simplest, only requiring a simple search of article titles–it doesn’t even need to examine the body of the article–to find the best match. “Step Dances” is also simple, we’re anticipating that it will find the “Step Dance” article using the snowball stemmer on the word “Dances”. Searching “World’s Most Populous Cities” will be tricky because it is neither an exact title match or an exact body match, rather it’s a concept we expect to find expressed somewhere within an article. It is, for example, somewhat present in the “World’s Largest Municipalities by Population” article. Lastly, the “British Trade” search is tricky because, once again, this is not an exact match, and we are searching for a group of articles with a common theme.

    6.1.2 Understanding our Schema

    Before we dive in and start writing queries we must first understand our schema. Let’s examine it by looking at the mapping for our Wikipedia data. The data is indexed under en-wikipedia/pages. That is to say under the en-wikipedia index with the pages type. Let’s examine the mapping per figure 6.1.

    The three most important fields in our mapping are the title, body, and redirects fields. The first two fields correspond to the article title and the main text of the article. The redirects field is an array of article title that redirect to the article at a location determined by title.

comments powered by Disqus