Interview with the Github Elasticsearch Team
Transcript condensed and edited from the original audio. Thanks to both Tim Pease and Grant Rodgers of Github for taking the time to answer these questions!
Andrew Cholakian: I’m here with the elasticsearch guys at github, Tim Pease and Grant Rogers. For those who don’t know, Github has a search tool that’s based on elasticsearch. We’re going to talk about how they operate elasticsearch at scale. For those who don’t know, can you walk us through what the actual applications of search are at Github? Is it just the consumer facing part?
Tim Pease: There is a lot of search at Github. There is definitely the consumer phasing part. If you go to github.com/search you can look through repositories, users, issues, pull requests. and source code. The goal is to make everything that is publicly available, easy to find.
Just those five things would correspond to five separate search indexes. Behind the scenes, we actually have probably a good 40 to 50 search indexes, just for all the different things that we keep track of. One example would be audit logs, so any time a user performs some sort of security-related action on Github, we write a document to an audit log in elasticsearch. If somebody thinks their account has been compromised or somebody has used it nefariously, we have a record of various actions associated with that account.
We also track all exceptions from the various components that power Github.com. All of those exceptions end up in an elasticsearch index, which means we can do work with some fun, little histogram facets and various other statistical facets. We can, for instance track increases in the rate of a particular type of exception and that will reveal bugs in our software.
Andrew Cholakian: So you guys have your own error reporting framework built in elasticsearch. Did you say that you use the time histogram facets to see where the spikes are in that graph?
Tim Pease: Correct, That is something that Grant could definitely talk much more about in-depth if you’re interested.
Andrew Cholakian: I actually am interested. There’s ton of different databases out there, so what made elasticsearch standout for that task?
Grant Rodgers: Originally that surface was using only Riak and the idea was to use Riak secondary indexes. They were performing analytic calculations and storing them in Riak. For whatever reason that didn’t work pretty well. It was decided a few months ago to move all of the calculations to elasticsearch using the histogram facets, and that’s worked really well.
We’re just really happy with the way elasticsearch performed. We’re trying to expand its use in that particular application.
Andrew Cholakian: Do you guys have multiple clusters for these different applications or do you have one big co-tenant cluster?
Tim Pease: We definitely have different clusters. The exception tracking lives in a total separate cluster. That way if we get a large volume of exceptions for whatever reason, it’s not going to take down other parts of the site. We try to segregate production clusters from things that are either just research or supporting internal tools. I think we currently have five clusters that I’m aware of across a couple of different data centers.
Andrew Cholakian: I’m wondering, Can you give us an overview of what a Github elasticsearch looks like? What goes into the building one?
Tim: Oh my goodness.
Andrew Cholakian: If it’s huge, maybe a minimal version of how that whole thing works.
Tim Pease: I think addressing it from just how we progressed in time would give you some good insight. We started using elasticsearch roughly two years ago. Originally all search was done inside of Solr. As more people started using Github and putting their repositories on there, we quickly exceeded the volume, just literally the storage space that one Solr cluster and Solr instance could handle. The choice then was to be figure out how to shard our own data and cluster Solr that way. Or do we move to something else that will handle that for us?
We decided to move to elasticsearch, because we figured Shay Banon could shard things much better than we could. We started migrating things off of Solr on to this small four machine cluster inside of Rackspace cloud. Everything lived inside of that cluster. All these different things that we’ve talked about, what the audit logs, all of our production data, all of our internal tools were being supported on this one instance there. That worked out very well.
From there, we decided to start working on searching source code, which is an entirely different scale of data. We quickly realized that a few machines inside a Rackspace cloud could not handle this volume of data. We began spending up a new cluster inside of Amazon’s EC2. That definitely is the largest cluster [we have] by sheer volume of document stored and machines involved. I believe that’s 44 separate EC2 instances at the moment. Each one has a ephemeral SSD storage attached to it. Each machine in there has two terabytes of the ephemeral SSD storage.
That one is running elasticsearch 0.2, The volume of data there is 30 terabytes of primary data.
Andrew Cholakian: Could you walk us through the schema you guys use? What’s a document in there look like?
Tim Pease: We have two document types in there: One is a source code file and the other one is a repository. The way that git works is you have commits and you have a branch for each commit. Repository documents keep track of the most recent commit for that particular repository that has been indexed. When a user pushes a new commit up to Github, we then pull that repository document from elasticsearch. We then see the most recently indexed commit and then we get a list of all the files that had been modified, or added, or deleted between this recent push and what we have previously indexed. Then we can go ahead and just update those documents which have been changed. We don’t have to re-index the entire source code tree every time someone pushes.
Andrew Cholakian: So, you guys only index, I’m assuming, the master branch.
Tim Pease: Correct. It’s only the head of the master branch that you’re going to get in there and still that’s a lot of data, two billion documents, 30 terabytes.
Andrew Cholakian: That is awesomely huge.
Tim Pease: It is. It’s really scary if I think about it too much. That’s the repository. They’re very tiny. We just do GETs on those based on the repository IDs from our database. It’s very quick to access. The source code documents contain things like the path of the file, the actual file contents, the extension of the file, how big it is, whether it’s public or private, what the language of the file is. We also store the git blob sha in there and the last commit sha that we had that actually touched that particular file. How much more detail do you want?
Andrew Cholakian: Just to go a little farther, is the actual source in itself is in a single field in the document?
Tim Pease: That is correct.
Andrew Cholakian: Can you walk us through the process of designing the queries powering you’re consumer site search. What was that process like and how did you guys approach that problem?
Tim Pease: I think when I was hired on, I was given the task of, “Hey Tim, here’s all the Solr data. Let’s turn it into elasticsearch stuff.” I just dug in and started with the issues and all the comments associated with the issues on Github. I thought, “Oh that would be pretty easy and straightforward to do.” I started with the active record models that comprise the issues and all the comments there and looked and saw what data do we had available. I then sat down and looked at the current queries that we have in Solr and basically tried, as a starting point just to replicate, the types of searches there. Then talked with various people and looked at what our customers were wanting out of search. We started adding features, different ways to filter the data, different ways to figure out, “do we want to be able to search?” We also were wanted to allow people to search their private things as well
Andrew Cholakian: I assume you had to stay away from a large number of the less performant queries, and it’s probably just the simpler match queries that you guys use?
Tim Pease: When you type in a search into any of the search fields on Github, it’s using the lucene query parser. So, everyone out there don’t just start trying to do all kinds of crazy things. We actually go through and escape nearly all of the special characters. If you try to do a phrase query and throw a phrase slop on there, we’re actually doing to escape out the , because those are things that would destroy performance of the cluster. What we wanted out of the Lucene query parser was a quick way just to be able to do phrases, so you can put double quotes around things and match entire phrases. We also wanted the and/or support. As we’re constructing queries, every word you put in there we’re by default using the and operator so you can then put or in there. You can also put not in front of words and things like that. That’s I think a good starting point. It’s on our dream wish list to go back through, pull out the lucene query parser, and write our own parser at the ruby layer to support and, or, phrases, not, and things like that.
Andrew Cholakian: Going back to the cluster, you had I guess 44 EC2 boxes with 44 how many terabytes again?
Tim Pease: Eight of them are not storage nodes. Eight of them just handle the queries.
Andrew Cholakian: Oh okay, so those were the master only, no data, nodes.
Tim Pease: Yup and those are the ones that we send queries to, where they fan out to the various storage nodes. We have 36 storage nodes with two terabytes each so bout 72 terabytes.
Andrew Cholakian: How did you guys, before you launched this product, load test this thing to make sure it would actually work when you put up the Github blog posting announcing it?
Tim Pease: This is embarrassing, but when we first got everything configured and set up, all the staff users of Github hit it before it was live. When we sent out a blog post and said, “Hey, we have this new code search and people can start using it.”, The load went up on the machines. We were just keeping an eye on it. A few days later, someone said, “Oh my goodness. You can search for people’s passwords on Github,” and “Here are some fun queries that you can use for people’s source code.” These people would check in either their private SSH keys, or various other things like that. The load spiked a lot there. Then the whole cluster basically became very unstable and crashed.
Andrew Cholakian: When you say crash, what specifically was the failure mode?
Tim Pease: Some of the machines were dropping out of the cluster and so the shards that those machines had, elasticsearch would then try to replicate them off to other places. It was a very weird failure mode where the machines would be coming in and dropping out of the cluster. A few of the machines would go off and form their own cluster. We got into a split-brain situation there. At that point, we disabled queries. We put up a notice saying “Sorry, code search is offline.” We reached out to the elasticsearch guys themselves. They actually spent a good 48 hours with us in a campfire a chat room walking us through the stuff to get this cluster back online. Some of the shards were corrupt and not recoverable. Again, this was with the 0.20 version. We had also chosen a poor version of the JVM to run this on. I believe we were using OpenJDK  at that point in time.
Andrew Cholakian: They recommended switching off of JDK at the seven JVM.
Tim Pease: Yup. We’re on OpenJDK which was Java 6 at that point in time. They said that there were definite known issues for that version and they recommended going over the Sun Java runtime and the latest JDK there. Believe it or not, that really fixed a lot of things. The core reason there if you want to get into details is that elasticsearch is using a lot of the NIO libraries for all the async communication and whatnot. Those were not full baked in OpenJDK6.
After we applied, upgraded our Java, Shay found one or two bugs in some of the shard recovery procedures and they got us a quick patch version of elasticsearch. We actually got elasticsearch back up and running. It’s been very stable since then. We lost some of the search data, but then it was just a matter of going back through and just re-indexing the repositories that had been lost.
Andrew Cholakian: Interesting. Was it mostly the JDK issue or were there any other issues with the configuration and other operational things?
Tim Pease: Definitely with configuration. We adjusted some of the recovery settings to help prevent split-brain. A setting discovery.zen.minimum_master_nodes is very important there. If you don’t do that, things will go south.
Andrew Cholakian: For those that aren’t aware of that aspect of elasticsearch that pretty much says, correct me if I’m wrong here Tim. It says that you have an expected size of your cluster. If you have a hundred machines, you don’t want a bunch of nodes to form a cluster if there’s only four of them.
Tim Pease: Correct.
Andrew Cholakian: You know ahead of time that that’s bad. Now, it could be the case from elasticsearch’s perspective, that you suddenly decided you only need the four nodes, that you didn’t really care about the other 96, but if you make that commitment in your config file, then these four will just assume that “Okay we are on our own kind of rogue mission and we should abort.”
Tim Pease: Yup. It’s very interesting, but the bad thing is when you get into that split-brain mode the few machines that are in that smaller cluster will basically decide, “Oh we don’t have all the shards and so we’ll just create a bunch of empty shard files,” and it just gets really yeah. It gets ugly.
Andrew Cholakian: So, what caused it all is essentially just too much load. There was a ton of CPU usage, and probably network traffic, and essentially communications stuff in the environment protocol had that issue.
Tim Pease: I believe it was an IO layer there where the backend transport on port 9300 was having difficulty with all these machines communicating with each other
Andrew Cholakian: That’s a really fascinating story. I think most people don’t really have the chance to experience that. I’m glad you got to experience that for us.
Tim Pease: Wistful memories to remember while sitting at home by a fire with a glass of Scotch, “Listen to these tales young’uns.”
I will say, we’re trying to move away from EC2 infrastructure. Our intention, when we threw everything up in EC2, was to eventually move out of there. We have acquired physical hardware that we’re spinning up in a data center on the east coast close to where the Amazon data centers are. That cluster is actually just eight machines, so we’re replacing 44 EC2 instances with the eight pieces of physical hardware.
Andrew Cholakian: Are you going to keep the same scale, or cluster, or are they going to be much, much larger machines? Is that the basic idea?
Tim Pease: These are definitely much larger machines. Each one has 32 CPU cores. I believe they have 14 terabytes of SSD attached to a hardware raid. There we are using the rack-aware futures in elasticsearch 0.90. We have two racks with four machines in each. The replicas are distributed across the racks. One entire rack could go down and we would still have an operational search cluster.
Andrew Cholakian: For that kind of machine of 32-way multicore machine, is it recommended to run a single JVM instance per machine and just let the threading handle it, or to maybe virtualize and have a few virtual machine servers on each one?
Tim Pease: We are doing it single JVM and just letting the threading handle everything. They’re very nice machines.
Tim Pease: I’m sure this is going to come up, but how do you migrate 30 terabytes of data?
Andrew Cholakian: Yeah, very carefully, is that the answer?
Tim Pease: They’re distinct clusters since we are not using elasticsearch as our source of record, it’s not our conical data store, we are just reading again all the file objects from the git repositories, and just re-indexing everything into the new cluster. There’s no migration or anything like that. When people push new code up, we are actually doing rights to both clusters at the same time.
Andrew Cholakian: When you architect a cluster, the size of Github’s, one of the big concerns is going to be shard granularity, potentially even index granularity because a shard and an index are actually very similar things in elasticsearch. How did you guys approach that problem?
Tim Pease: In EC2, we have 500 shards and that is too high I would say. Each shard is about 50 to 80 gigs in size depending on which ones you’re looking at. If you could do an optimized operation at the shard level, it wouldn’t be a problem. Unfortunately optimize only happens at the index level and so much of these management things happen there.
Andrew Cholakian: Can you walk us through the optimize operation and why it’s important for you guys specifically?
Tim Pease: We had not done an optimize on this code search cluster. For those who don’t know in Lucene, when you write out the inverted index and just keeping track on which words you’re in which documents, if you delete a document, you have to go through update quite a few data structures on disk. If you were to just rewrite those data structures every time, it would certainly destroy performance because you would just be thrashing the disk all the time trying to delete documents and add documents. Lucene is very smart in what it does. All it does is set a bit in the data structure on disk saying that is this document, is it present or is it deleted? It’s a tombstone. After a while, inside these files on disk, you have 10% of all the documents and they’re might be marked as deleted. What an optimize will do is it will read every single one of the files and expunge the delete through a file. The reason you want to do this is it will greatly improve query performance. If you get into a very dramatic case where 50% of your documents are deleted, you have to step through each document, looking at these term vectors, [checking these tombstones]. I’m definitely in glossing over things at a very high level. That’s a whole course on information retrieval that you could take.
Andrew Cholakian: I’ll take it you’ve read the book “Introduction to Information Retrieval.”
Tim Pease: I am actually working through that book right now.
Andrew Cholakian: I’m in the same boat. I heard you use the term “information retrieval.” I figured you must have read that book.
Tim Pease: Yes. It’s a great thing.
Andrew Cholakian: For anyone listening, if you’re really interested in the guts of Lucene, that is a highly recommended book, “Introduction to Information Retrieval” will get you knee-deep in pointers, and skip-lists and everything else you need.
Tim Pease: It’s a definitely wonderful read. We read it to our kids putting them to bed and things like that. They’re going to be brilliant computer scientists.
Andrew Cholakian: I’m sure it puts them to sleep really quickly.
Tim Pease: It does. It does. Okay, going back to the optimize query. It’s a very heavily operation because you’re reading every bit of data off disk and rewriting it to new files. You also have to have enough disk space to have essentially two entire copies of your data on disk. It’s very heavy weight. Elasticsearch can only perform that at the index level, not at the shard level. If you have an index with 30 terabytes of data, then you’ll add your one replica, now you have 60 terabytes of data. If you want to do an optimize, you would have to have enough space to hold 120 terabytes of data. We just did not optimize ever on that EC2 cluster. It’s been okay. I believe our deleted document count is about 10% right now.
Andrew Cholakian: I thought that Lucene internally does expunge deletes automatically when it periodically merges the segments. Is that not correct?
Tim Pease: It is correct. If you have a large segment file–and I believe it’s with default elasticsearch it’s five gigs–If you have a large segment file and you delete documents from it, it will not necessarily expunge those deletes. It’s basically like “okay, I’m at my limit.” The only way to do that is when you combine smaller segment files into a larger one, during that process it does expunge the deletes out of them. If you’re already at the max size, that’s not going to happen regularly. I’m sure somebody out there who knows more about this will tell me that I’m wrong. Please add comments to the podcast and let me know how it is better. That is my poor understanding of lucene right now.
Andrew Cholakian: Just to add to that, for those getting into elasticsearch, and Lucene, and search in general, the sheer number of variables and moving parts within Lucene and elasticsearch is kind of insane. There are so many different tuning knobs and scaling features, things that at certain size require a lot of expertise to really understand.
Tim Pease: Yes, I thoroughly agree with that one. Andrew Cholakian: That was a fantastic overview of operationally where you guys are. I was really excited to dig into the details, but moving on, I’m wondering if you guys could talk about the application side about how you you guys integrated Rails with elasticsearch over there?
Tim Pease: Grant you’ve been quiet. You want to handle this one?
Grant Rodgers: The short answer is that we have written our own client libraries and supporting code for connecting the models of Github.com to elasticsearch.
Andrew Cholakian: What is the the lifecycle of a document being added and then the application code that runs after. I guess I’ll ask what the application code lifecycle is of a search being performed.
Grant Rodgers: I’ll start with I guess a git push. When we get a git push, it goes to the file servers. We get notified of it in a queuing system, a background queue. That queue goes to the indexing job which goes to elasticsearch like tim mentioned, sending the git SHA that was inedxed. We’re trying to see whether we really need to index that. If we do, then we go to the file servers again, and get all the files, and index that. That all happens outside of the website framework.
Andrew Cholakian: Are there any specific provisions for failure modes in that situation? Were there any kind of unexpected surprises or architectural things you encountered while building it?
Grant Rodgers: I think Tim might be able to answer this question a little better.
Tim Pease: With indexing source code on push, it’s a self-healing process. We have that repository document which keeps track of the last indexed commit. If we missed, just happen to miss three commits where those jobs fail, the next commit that comes in, we’re still looking at the diff between the previous commit that we indexed and the one that we’re seeing with this new push. You do a git diff and you get all the files that have been updated, deleted, or added. You can just say, “Okay, we need to remove these files. We need to add these files, and all that.” It’s self-healing and that’s the approach that we have taken with pretty much all of the architecture.
The two big things are everything goes into a queue. We don’t do any indexing within the real app itself. When we do index a document, it will always try to get it up to most recent version and have these marks in there to do that. For things like repositories, or issues, or pull requests, those are also documents and separate indexes. If somebody comments on an issue, we actually queue up a job in our resque system and then that job gets pulled up by the worker. We go and look at the issue object and we construct the document. Then we store that document in elasticsearch. Or, if an issue has been deleted by the user, we then go and find that issue and go ahead and delete it by ID from the search index.
Andrew Cholakian: Are you guys using the HTTP transport or the mem caster Thrift ones for talking to your cluster?
Tim Pease: Everything is HTTP.
Andrew Cholakian: I’m curious what your perspective is on this. There’s the three different transports elasticsearch supports, I think you guys obviously have quite a bit of scale, and it’s interesting that you guys use HTTp. The ease of use of HTTP in terms of both just client familiarity and also network infrastructure in general, is a key asset. I wonder how many people actually use the other transports given how well HTTP performs.
Grant Rodgers: At one of my previous jobs, we actually did use the Thrift transport briefly, it was a significant performance increase. Like you said, the operational burden of that one was significant. We eventually had to switch back to HTTP.
Andrew Cholakian: As far as the operational part, for me the reason I don’t use the thrift protocol is because we can use HTTP load balancers in front of our query handling nodes. That way we can just use a very simple pattern where if one of those goes down, then we’re still okay. What were the operational concerns that you guys had in switching over at that point?
Grant Rodgers: In our client at that time, we had distributed connections in the clients, so we didn’t need load balancers. One of the biggest problems was, it was difficult to diagnose what was going on in a thrift connection just because the errors are not very obvious. It’s just not as obvious I guess. It’s a lot harder to hit it outside of your client libraries. When you have an HTTP error, you can replicate that error in the command line with curl, but there’s no thrift analog to curl. You can only go through on libraries. It makes it difficult to diagnose where the problems are.
Andrew Cholakian: Yeah, I completely agree. I guess this is why JSON is such a popular format too. The tooling is so fantastic right now, it works even in areas where it isn’t perfect. To go back to the cluster stuff for a minute, one question I forgot to ask you guys is what are you key health metrics? What are the things that you look at to say, “Okay, things are going well,” or “I should maybe be worried at this point?”
Tim Pease: That is an excellent question. I’ll answer first and then I’ll let Grant answer as well. The three that popped up in my head are overall load on the machine, the amount of JVM heap that’s currently being used, and also the query response times. With all three of those metrics, we have them graphed, and we’re collecting data, sending it off to graphite. If we start seeing spikes in query response times, that’s a good warning there. If we look at the machines and they’re on their high load, that’s another good warning. Then we look at the frequency of major JVM garbage collection events and that is yet another good indicator that something is amiss.
Andrew Cholakian: If there’s too many GCs that means you’re probably close to your heap max. Do you guys use a specific GC or just the one that,I forget which one, that elasticsearch defaults to?
Grant Rodgers: Yeah. We’re using the concurrent mark sweep with perm gen two.
Andrew Cholakian: Is that the one that’s multithreaded and close to pause-less?
Grant Rodgers: It’s close to pause-less. Yeah, that’s the idea anyway. I think it’s the default.
Tim Pease: It is. Grant what are the metrics that you keep an eye on?
Grant Rodgers: Those are the three. I think those are the most important one. There are a lot of others like you mentioned the ratio of deleted documents and the number of open contexts, and things like that. You only really have to look at those if there’s a specific problem with a particular piece.
Tim Pease: To investigate problems, the slow query log is a invaluable resource there. We’ve used that quite a bit to find just odd, weird queries and go ahead and fix them at one layer or another. I will say, can we go back to the whole managing of the clusters and indexes? There’s something that we totally forgot to talk about.
Andrew Cholakian: Of course.
Tim Pease: When we are talking about optimize, it’s only performed at the index level. If you can structure your indexes to actually split your data across multiple indexes, that is a good thing to do. The easiest way to do that is time slicing your indexes. For example, with our audit logs, because we’re just always writing today’s data, what we do is each month we create a new index for that audit log data. The previous indexes can be put into a read-only mode, so we can no longer write them. Then we can do a full optimize. We just compress everything down to one segment. That gives us very performant queries. If you can do tricks like that, it’s definitely a good thing.
The other big trick that we use, because anyone can push to any repository at any time, we can’t really time slice our source code indexes. Instead, we use the routing parameter based on the repository ID. That allows us to put all the source code for a single repository on one shard. We have the concept of global search when you go to Github.com/search, those searches get spread across every shard in that source code index. However if you’re on just a single repository page, and you do a search there, that search actually hits just one shard. Those queries are about, I want to say, an order of magnitude faster, but that would be a lie. It’s about twice as fast.
Andrew Cholakian Cholakian: I would assume the load might actually be an order of magnitude or even more. The latency probably isn’t. I was actually going to ask you about that, because in every elasticsearch at scale thing I’ve seen, most say highly recommend routing.
Tim Pease: Yes, yes. Routing will save your bacon. The other thing is if you really can just get more indexes, and if you can set them up so you only write to them for a certain amount of time, and then you just archive those off, it’s really, really good.
Andrew Cholakian Cholakian: This is interesting because, from the perspective of Lucene, just to recap for anyone who’s not intimately familiar with this, a segment is a shard is an index essentially in elasticsearch. At a certain Lucene level, that’s kind of true, but from an operational concern standpoint, that’s definitely not true. Indexes can be composed together in the same way that indexes are composed of shards, and in the same way that shards are composed of segments. I guess what you were talking about before with optimize is that the operational granularity really doesn’t come into play except at the index level.
Tim Pease: Correct.
Andrew Cholakian Cholakian: From a search results quality perspective, I guess you guys didn’t really iterate very much in terms of writing the query, because you pretty much opened the query up to users to do whatever they want. Is that correct, because it’s just the Lucene query string?
Tim Pease: Yes and no. We’re selective about what fields you can query. With code search there’s only one text field in there, but at least with repositories, we have the description of repository that people type in, the name of the repository. We also index repository README files and so they’re searchable as well. You can actually target those individual fields from those query boxes to find what you’re looking for.
When we’re querying across multiple fields using the Lucene query parser, we’re actually doing a dis_max query. We’ve applied different weights to the various fields based on how important they are, at least in our minds. There’s for some of the tunings that we’ve done to get better search results and better is always subjective.
Andrew Cholakian Cholakian: Yes. That’s the funny thing. It’s always subjective.
Tim Pease: It is always subjective. It’s really fun. We keep an eye on twitter, so we watch when people comment about Github search. One tweet will come and say, “Github search is horrible, I can never find anything.” Then the very next tweet that we see is, “Github search is wonderful. It’s exactly what I’m looking for.” It’s very humorous.
Andrew Cholakian Cholakian: What advice do you guys have for people who are new to elasticsearch for getting started? What’s your advice in terms of getting up and going?
Tim Pease: That’s a hard question. Plan on rewriting everything.
Andrew Cholakian Cholakian: Ohhhhh, I did that.
Tim Pease: Yes. We started down one route and we learned like “oh this stuff is working,, this doesn’t work.” Definitely plan on redoing it at some point in time. Get as close to the elasticsearch raw interface as you can. If you’re using the HTTP API, go ahead and construct those JSON query documents by hand for the first go around.
You’ll learn a lot about elasticsearch, because that’s really what you want to learn. You don’t want to have to learn a lot about the client library that you’re using. Find one that’s just, like, raw, and you just give the raw endpoints. Where you can just give it a JSON document, and it will just return to you the JSON’s old document.
Andrew Cholakian Cholakian: I’m going to echo that actually, just because I don’t think that point can be said strongly enough. The reason I agree with that is because I did exactly the wrong thing. There’s a lot of ways to get an out of the box client that just works with your framework, just integrates everything and makes it searchable.
What you wind up realizing is that you have a very strange learning curve. You have a lot of wins early on. All of a sudden you have searchable data. Then you just hit this wall where you don’t know what’s happening anymore. Where you want to extend it or improve it, and you have no idea what’s going on, and you have to kind of start from the very beginning, and you actually lose all those early gains. I completely agree. Use the JSON API until you actually know what’s going on.
Grant Rodgers: Yeah, I would even recommend for everybody to write their own elasticsearch client, because it’s so easy. It would take you maybe a few days and you learn so much about how the API works and how queries work and all that. It’s a really good learning experience. We do have to actually use a client, because there are better ones out there, but I think that’s the best way to learn.
Andrew Cholakian Cholakian: I completely agree. As the author of a client, it was a huge learning experience for me. I guess that kind of wraps up the questions I have for you guys. Is there anything either of you would like to add before we wrap this up?
Tim Pease: I know we just said write your own client, do everything as close to the metal as possible, but at Github we’re in this position where we have the Github.com search, but we also have a lot of internal tools which are using search. Currently, we’re just two guys. It’s just me and Grant Rogers here. We can’t support everything. Other people in the company have stepped up, and started using it, and they ask us lots of questions.
We’re spreading it all around which is wonderful, but I would love to be able to hand them something that says, “Here is, you can use this and it will help you set up your indexes. It will help you write your queries. It will help you do all these things.” It’s such a hard problem, because the way you structure the index and the way you structure the queries is very domain-dependent. It’s very, very sensitive to what you’re actually trying to accomplish.
I get jealous of things like active record, which is based on relational algebra, even though it’s ... I won’t say it’s simple to do, but it’s based on these understood mathematical principles. Whereas I think with search, there’s a lot more of a creative element to it in that you have to understand the domain that you’re trying to search and exactly what you’re going to get out of it. That heavily influences how you structure your queries, how you structure your documents and things like that.
Andrew Cholakian Cholakian: I completely agree. You’ll often times start down a certain path and you’ll make incremental improvements. Then you’ll discover that your entire approach to a search problem its wrong. That’s something that’s very in SQL, that almost never happens in SQL, unless you have huge query optimization problems, which is close to never.
Tim Pease: It’s also wonderful when you realize that the analyzer you’ve used to process your 30 terrabytes of data had a bug in it and it wasn’t giving you the right search results.
I’m not going to point fingers at anyone on this phone call who did that, *cough* Tim. Actually moving to this new hardware has been a blessing in disguise, because now we can fix all the mistakes that we made in the EC2 configuration stuff.
Soon, everyone, I promise. Soon.
Andrew Cholakian Cholakian: Grant, is there anything you wanted to add to the conversation before we terminated the recording?
Grant Rodgers: Only that the elasticsearch is really good. There’s no way we could do any of this stuff without that piece of software first, so thank you Shay Bannon and all the elasticsearch folks for producing such a great piece of software.
Tim Pease: I will say that they are a wonderful group to work with. It has been absolutely delightful getting their input and help on so many of these problems we have faced. There’s no way we could maintain this scale of data without definitely their help, so thank you.
Andrew Cholakian Cholakian: Well, thank you both. Thank you Tim, thank you Grant. I guess this will show up on our site soon.