Interview with Nick Zadrozny of Bonsai.io
Check out Bonsai.io for more info about hosted elasticsearch.
Andrew: Nick is one of the founders of One More Cloud. They host Solr and elasticsearch of course, instances for you to use. Nick, I’m wondering if you could walk me through what bonsai.io is and how you guys started that?
Nick: Yeah, I definitely will. Bonsai is actually our more recent service. We cut our teeth on Solr with websolr.com which I believe is one of the very first of its kind as far as hosted search as a service. It was back to about 2009, I had just been wrapping up about five years of freelance web development using Ruby on Rails with a couple years of Java Development before that and ran into an old college friend of mine at a Ruby conference at San Francisco and he had shared with me his current project which was starting this Solr hosting service.
He had been tapped by the some of the founders of Heroku who knew he had some search engine experience and they were just getting started with this add-on thing. They basically dropped this idea in his lap like, “You should build a search engine for us. We’re getting this add-on thing. It’s going to be pretty cool.” It was great case of right place, right time, right skills, right interest, and so I joined up with him as a co-founder and in early 2010 we started charging for Websolr. We bootstrapped into a real business and ran that thing for ... we’ve been running it for like three years and some change now, almost four years now since the first commit in our code base. We got to be pretty good at scaling [the] Lucene search engine, Java distributed systems, all that kind of stuff. I’ve been pretty heavily involved in search.
I maintain sunspot, one of the popular Ruby clients for Solr... or at least helped to maintain. I’d love to find some more maintainers if anyone...
Andrew: Out there who’s interested?
Nick: Hehe, yeah, the plight of the open source developer, I know right? Yeah, when I learned of elasticsearch I actually learned about it from the creator of Sunspot, Matt Brown who was using elasticsearch in a newer startup that he was a part of and he just went on about how awesome it was.
I went through and looked at its documentation and its API and stuff and really liked what I saw. That was about November of 2011. I was kind of aware of elasticsearch, it was on my radar. A few months later we decided to pull the trigger and actually build another hosted service for elasticsearch and launched that into a public beta towards the end of the 2012 or mid to late 2012 and I started launching its production in January of this year.
It’s been really cool to see the uptake on that. It’s actually been a really nice experience for me and for our business to launch sort of a sister service for something we already had. You don’t often get an opportunity to go back to the basics and rethink of everything afresh. Oftentimes when you try to do that it turns into this big rewrite that always flops. In our case it was an opportunity to do the great rewrite and actually it’s gone really well.
Andrew: That’s fantastic. I’ve been realizing now by the way. I didn’t realize you maintain sunspot. I actually gave you guys a patch that horribly broke sunspot on rails.
Nick: Well, sunspot is well behind on its patches and poor request and I need to actually carve out some time. Maybe I’ll have you fix that as your penance.
Andrew: Oh, well, that was fixed a year ago but that would be appropriate. In the search space, elasticsearch obviously has risen really quickly. I’m wondering, as a business decision, what did you guys look at when you decided to start supporting elasticsearch alongside Solr which is obviously the incumbent in the space.
Nick: I mean it was pretty natural fit in terms of technologies. A lot of our knowledge scaling Lucene and Java just dovetails really nicely to elasticsearch and also in terms of the usage.
You’d be surprised how many of you asked me questions about elasticsearch. I’ll actually have no idea what the syntax is in elasticsearch, but I know exactly what the feature should be and how it should work. So, it was just a real natural fit there. We have a lot of our customers who are themselves developers and moving around on different projects they have gotten wind on elasticsearch from one method or another and they’re interested in it.
They actually come to me and ask me, “You guys do Solr, you should do elasticsearch as well because I want to use elasticsearch but I don’t want to run the servers myself. I already use Websolr and love it and so there’s a fit so you guys should do that.”
It’s mostly actually our customers telling us that they wanted it and that was enough for me as far as the business decision goes. Justifying the rest of it from there it’s just... they’re pretty similar. They share a lot of the same DNA. I don’t know maybe the elasticsearch developers would cringe to hear me say that. I’ll say some things about elasticsearch later on to make up for it. From where I’m sitting, running the servers and managing all of that ... we’re able to leverage a lot of our existing knowledge and skills and not have to reinvent the wheel in a lot of places.
Andrew: You said that a lot of your customers are Websolr customers that want to start using elasticsearch. What is driving–this is a good chance to ask you about elasticsearch–what is making these Solr users wants to pick up elasticsearch as a processor?
Nick: I would say most of bonsai.io’s customers are also web Solr customers. [It’s] just that there was enough of an overlap of those developers to make it interesting to us. I would actually say most of our customers on bonsai are people who are brand new to search. I think it’s kind of a process where, you’re vaguely aware that there’s a better way to do search than just your database.
I think we have the whole NoSQL movement to thank for helping us as developers think outside the database box. There’s a lot of great tools out there for doing things with your data in different ways. I think people have this kind of awareness of search engines. When you go out and look around and try and decide what to do. There’s a handful of search engines out there, but Lucene has a pretty strong showing for search engines and rightly so because I think Lucene itself is definitely our open source industry standard of doing search.
But then from there you take Solr, you take elasticsearch, you compare them side by side. Just look at some of the starter examples of how to use it. I think elasticsearch has a much nicer developer API and developer experience. Solr has a lot more history and a lot more legacy, a bit more fragmentation in terms of the design of its API. There are nice bits of it but it’s not easy and readily apparent. Just kind of that learning curve, I think, is a lot easier to communicate with elasticsearch than it is with Solr.
I think that’s a large driver of why people are interested in elasticsearch. I think that the fact that it’s younger, it has a fresher API. It had the benevolent dictator in a form of Shay [Banon] who got a chance to just reinvent this API from scratch and went with the really nice cohesive restful JSON thing. I think that really drives a lot of our customer adoption. We’re in this market where we have a lot of startup customers and developers, freelancers who are going from one project to a next, they’re pretty interested in staying up-to-date with what’s the latest and greatest. They’re really willing to give something new a try. I think we’ve seen a lot of that.
Andrew: You mentioned that a lot of your customers are people who are new to search. As a hosted service provider must make for an interesting experience because I assume that makes it a high touch business to be in when you’re having a lot of people who are learning the technology along with using you guys as a host. That would put you in a good position, I would think, to talk maybe about the road blocks people hit using elasticsearch?
Nick: That’s a great question. We’re definitely really poised in that position of... there’s fairly high touch. I think the community as a whole though for elasticsearch is pretty welcoming. It’s in a phase where everyone sort of learning together so there’s a great mailing list. The elasticsearch incorporated training sessions from what I hear are just phenomenal. They cover the whole thing top to bottom. If anyone who has a chance to actually go to those, I hear they’re just a really great value. There’s plenty of resources for people to learn.
I think a lot of people’s experience of any search engine actually is going to be based on the client that they’re using to integrate their application. That’s true whether it’s Solr or elasticsearch and so I actually do find a lot of people choosing their search engine based on their client and the quality of the client’s documentation, how well it’s configurations or DSLs, or functions, or whatever lined up with their own mental model.
Andrew: That’s a really interesting thing you brought up about client integration. I actually have some questions for you about that is because... the client integration was something that really drove my initial experience with elasticsearch so I definitely understand what you’re saying with that.
I think, maybe I could just tell you a little about the changes I saw in how I viewed search in relation to an app, and I’m curious to hear what you think of that story in terms of what your customers see and how you see things.
When I picked up elasticsearch as a rails developer, we needed that search to our app. The primary way I thought about it was that we have these models, which in rails equate one to one with a table in the database, and we need to make these things searchable. I [thought, I] need something that just takes a model and makes this searchable version of it in this elasticsearch thing and then I’ll start searching it. I just want something that just says, “Attach this model to elasticsearch” and then we’ll start searching and go from there. That was an okay starting point but the further we got along and the more I learned about search, the harder our problems got. We realized that the one to one mapping between models and a searchable unit in our elasticsearch cluster was just not really playing out.
Our model in elasticsearch [now is] ... a very denormalized version of our stuff in the database. We wound up stripping out almost of the direct active record integration because we had a lot of weird things going on where things weren’t one to one mapped. We have a lot of triggers and stuff like that for when things get updated and stuff like that.
Basically, I feel like there’s an spectrum of use cases between people who just want to take a simple little thing and make things searchable, and people who have like a search problem that’s a lot more complicated than “I have titles on my blog post and I want them to be searchable”.
I guess the question for you is, do you see a lot of the developers on your service going through a similar process of weird denormalization and re-structuring of their thinking of their apps around search, more than search as an add-on.
I don’t know if that’s a vague question but that’s something I’ve been thinking about for quite a while.
Nick: That’s a great question. I think there are a lot of those questions [that] get rolled up. I think people go with the defaults of their clients until it starts causing pain. The pain points I see are, you know, someone’s reindexed and now they’ve got an index that’s really huge because they’ve never really been explained to the concept of sharding where you split up your data across multiple nodes and a cluster. So they’ll up with just one gigantic index where the shards are very much out of proportion to what you would otherwise like to see given the capacity of your cluster.
So, like “how do I design my sharding’s strategy?”, I get a lot of questions of “how many indexes do I need?”. These are abstract concepts and so I find myself explaining a lot of concepts in terms of relating them back to the database. I typically advised people that an index is analogous to a SQL database. Elasticsearch is really flexible so you can use them for a lot more than that. For example Tire using index per model.
Or you can do other cool ways of using your indexes and elasticsearch like a temporal sharding strategy where you have a time within your index name and group them all up in the alias.
There’s a lot of these questions of just how do I map the ideas of a search engine to the ideas of a database that I’m already familiar with that I end up talking with a lot of people too. The deeper side of that is when you talk about which fields do I want to index and what exactly am I trying to search? I always go back to the tried and true of encouraging people just to set the search engine aside for a second and just design the search results page that they want to design.
I just really think about this from the perspective of the user and look at these mockups they came up with and think about what is on this page, like, what am I searching? You’re gonna wanna wind up indexing the thing that you search. That can be sometimes a little unclear when you have these hierarchies of associations. Do you want the NGO or do you want all of the positions that the NGO is hiring for?
Andrew: What do you mean by NGO?
Nick: Sorry, a nonprofit organization.
Andrew: I was curious. I wasn’t sure which just [inaudible 00:16:36] we’re talking about.
Nick: Just pulling out a random like customer example at some point there. Questions like that you know, going through the design and going through like what’s the user thinking, what’s the user looking for that’s going to help clarify some of that.
Also, in getting into general recommendations I find that, well, first of all Lucene has a flat document design where you have an index and index has many documents, that’s it. There’s no relationships between tables like you have in data base. If you have more than trivial search case and you have associated objects and you want to pull in values from one of the other, you’re going to have get comfortable with the idea of denormalization.
This is another idea that I think developers are more comfortable with given the movement towards no SQL database over the last few years because it’s not an uncommon feature of like a key value store. I think the real trick is doing your indexing, setting up your indexing and your denormalization in such a way that it fits with your problem domain. In some cases that means abstracting your search logic out a little bit more than just sort of the tight one index per table or whatever that you’re getting started with by default.
Andrew: That’s a really good point. I think when I initially approached the elasticsearch, I tried to create a schema in elasticsearch that was very similar to what our relational one had, with a very complex mapping of objects with nested objects that was completely irrelevant to our actual search results and with a colossal waste of time that had no output. That was a really welcome point and that’s a good thing for newer elasticsearch developers to hear.
Nick: Yeah, nested object, actually I would say when in doubt, denormalize, don’t nest, because it doesn’t exist in Lucene. You’re actually delegating that logic to elasticsearch which is doing some denormalization for you. It’s better to be a little bit in control unless you understand the whole nested document thing pretty well and understand what it’s doing.
Andrew: I would agree completely with that, that’s very, very good advice. Moving on a little bit. Elasticsearch is also interesting in that it seen adoption in a way, I don’t think I’ve really Solr see adoption which is as a general purpose data store, where people are using it not really for its full text searching capability but just as a place to put denormalized data.
You mentioned how elasticsearch is part of NoSQL in the past. I’m wondering are you seeing, bonsai.io had experienced with these kinds of customers and what your thoughts are elasticsearch as a general purpose data store.
Nick: That’s a great question. I think that it is a noble goal. Certainly it would help me out a lot as a business for people to use elasticsearch as a primary data store. I think that’s a really tricky goal to implement with Lucene, if you are allowing yourself to have the ability to go back and make updates. Just because lucene requires that you update the entire documents.
When we talk about updating one field at a time, the search engine itself has to do more work to fetch back existing values and reindex the entire document. There’s just a lot of statefulness that happens and a lot of edge cases that you don’t really consider, to go into making elasticsearch, or any Lucene based engine into an actually worthwhile primary store.
That is a really hard recommendation for me. I actually don’t generally recommend people treat it as a primary store either it or Solr. Just because of that, the whole idea. By the time you’re data gets turned into documents in Lucene index, there’s been so much slicing and dicing of your fields and their values.
I always try to go back to this example of what is an index? An index is not a new concept. If you pick up a book and you go to the back of that book it’s got a list of interesting terms and the pages where they occur in that book. That’s really what we’re working with when you talk about an index. It would be hard to make a case for like that index in the back of your book being an accurate representation of the original copy of your data. The way you take that is actually to store the entire raw JSON that’s gets pushed in and that’s what elasticsearch does. It ends up building a data store on top of Lucene which is helpful.
I think that elasticsearch would like to become a good primary store and I think it has that potential. I personally, may be that sort of like war grizzled veteran where I’ve seen too many instances of like index corruption because something weird happens. Like I said it’s the same for any data store like you need to be sure you’re really backing up the data.
I think it’s a... Lucene itself, Solr, elasticsearch, all come from this tradition where they’re building a very smart secondary index into a primary data source. And, I think because of that they’re able to eschew a lot of the standard database features. This is the reason why Solr and elasticsearch and search engines in general, I think are so much better than doing full text search than a database because in a database, there’s a lot of work and a lot of concerns with, you know, transactions and maintaining your database client and all the sort of asset principles.
They invest a lot of effort into that stuff to the detriment of being able to really search. There’s tradeoffs. I think indexing and source document can be a nice convenience helper like if you want to reindex your data and not have to jump through many hoops.
Andrew: Given your lack of maybe full of trust in... as you mentioned you had some issues with some index corruption, stuff in the past and things like that. I agree I mean at Pose we don’t use a lot of elasticsearch as the primary store for anything. We can recreate the whole index at anytime of our choosing... what kind of position does that puts you in as a host and service provider and what do you guys do for data durability.
As I understand it, the only real ways to do true elasticsearch backups are to pretty much tell the cluster to start buffering stuff in memory instead of running through disk and then snapshot the index files on disk, unless I’m mistaken that’s not an area I’ve too deeply into.
What do you guys do as far as durability goes, and what would you say a good expectation is as far as durability with elasticsearch?
Nick: That’s a great question. I think that as far as the primary, secondary data thing goes, I would always just keep that in your back pocket to make sure you’re storing your data somewhere else that’s not in elasticsearch. Even if you’re going to mostly use elasticsearch as a primary data store, at least pipe all of your data into, I don’t know or Hadoop or Cassandra or something, just archive it.
Whatever the most pure form you get it, dump it somewhere so that if worst comes to worst, you can rebuild your cluster. Then from there maybe your application is mostly just interacting with elasticsearch. Those kinds of use cases would be fine.
As far as durability goes, I think having a good sharding strategy and a good like... relying on elasticsearch replication is definitely your first line of defense.
Andrew: Nick are you still with me?
Nick: Sorry about that. Somehow I hit the mute button accidentally. Where did you lose me?
Andrew: You said the sharding and replication were a good first line of defense I believe.
Nick: Yup! Sharding and replication are good line defense. That’s because elasticsearch maintains enough state when it’s running normally that if you lose a node, something happens with an index or a shard on one machine, it can just rebuild from another copy. That’s a great first line of defense and it’s a huge selling factor for when you’re going from staging or internal use case to actual live production. When you’re live and in production you definitely want replication as your first line of defense for maintaining data integrity.
Beyond that, having a good backup strategy is pretty key. For the most part you can do pretty well with rsyncing, the actual data on the disk. We do like disk base snapshot and rsync offsite. We’re pretty comfortable with that.
Andrew: That brings me to my next question which is, I would assume you guys operate one of the larger elasticsearch clusters out there. You guys sounded like you’ve got an interesting learning curve being one of the earlier large scale users of elasticsearch.
I’m wondering if you could search some scaling tips from your experiences?
Nick: Yeah. That’s a great question. Well, scaling tips besides “use us and pay us to scale for you”
Andrew: If you got some kind of brain problem where you can’t do that...
Nick: The classic case there is people who don’t... you aren’t able to do cloud based hosting for whatever reason so they got on premise. My general advice for that is that your ability to index efficiently is really useful, because, first of all you’re going to have to populate your index the first time. You’re going to have to update it on an on-going basis. In case of a catastrophe you’re going to have to like wipe the whole thing and rebuild it. Getting good at indexing your data is worth investing into. The biggest tip I have to offer there is, it is really inefficient.
People always ask me like, “What sort of batch size do I use? Do I want to use highly parallel small updates or serial large updates with large batches?” I say, “When it doubt go with the large batches.” Because, single document updates are the worst case scenario for the Java, JVM garbage collector. You give it your request with your document. It parses that all out to a bunch of objects and memory. As you alluded to earlier, elasticsearch will hold on to that for a little bit until it’s gotten a batch that it will then flush out the disk and create a new segment in Lucene.
You’ve got these objects that are hanging around in memory for a while that the garbage collector can’t touch. If you abuse that, you’re going to get into this case where your garbage collector is going crazy and pausing your process in order to clean up from the last couple minute’s worth or whatever of memory.
That’s been one of the classic, I mean we still see those situations to this day both on Websolr and on elasticsearch where someone’s garbage collector starts going crazy because someone decided to reindex with a thousand workers in parallel each sending in one update. Biggest advice there is to get good at putting all... that said like when you’re reindexing and driving that from your application, it often really works your benefit to do things in parallel because you’re reading out of a database, you’re serializing into some format, you know, JSON, or XML or whatever. That work really benefits from being done in high, really high I mean parallel. It’s the classic mismatch between the amplitude of what you can do on like read volume versus write volume.
It definitely helps to collect all that data in massive parallel and pair it for elasticsearch. On the elasticsearch side you really want to aim for this convention of one process per primary shard essentially, something on that magnitude anyway, one process for CPU core in your cluster if you have that sort of awareness.
Andrew: Seeing that if I have a cluster of ten cores in it and I’m trying to pump data into it, I want the external application to no be... we want to have about ten processes running there.
Nick: Exactly, and you want them to sending in batches of anywhere from a few hundred to a few thousand, depending on your volume. This kind of stuff takes some tweaking and some testing but even just that device alone puts people in the right neighborhood.
If you’re in doubt in a relatively small scale, I would definitely encourage, figure out a way to do your reindexing from one or two processes. Pull the JSON out of the queue somewhere and do it that way.
Andrew: I would agree with that advice myself. The batching aspect of it is also important I think because a lot of, I would argue that one good feature, one good thing to strive for in the architecture would be to decouple your app as much as possible from elasticsearch unless it’s necessary in terms of request life Having queues like you mentioned, and then building up batches of requests is usually a part of that architecture anyways.
Moving on, One of the questions I have for you, as a company operating at scale, I wonder if you’d mind letting us know some of the secrets you guys have as far as how your cluster is different than a standard apt-get install elasticsearch, or dpkg -i elasticsearch.deb. What kinds of things do you guys do to the bonsai cluster that are only acquired for people at your scale?
Nick: Some of the differences are the classic, tune things for production. We have some settings related to the garbage collector and memory allocation that I think are unique that probably wouldn’t apply to most installations unless you’re operating at our scale.
Andrew: Which GC are you using for the JVM?
Nick: You know what, it’s my business partner Kyle, who’s the Java experts so he would be the one to ask about that kind of stuff. I just know that he’s experimenting with a lot of different like JVM tuning settings over the years. He’d actually spent two years at Twitter doing distributed databases and Java distributed systems, so we benefit a lot from his experience there. We definitely tuned that stuff quite a bit. It’s not stock JVM GC settings.
I think the general recommendations of give elasticsearch half the system memory and using their defaults should work pretty well for most people.
Andrew: It’s actually generally a good point usually when you’re winding up tuning settings it’s because you have a very specific workload usually a default server.
I do know that you guys, I think it’s on blog posted recently about how you guys moves to having data free master nodes?
Nick: You know what, we’re still working on that.
Andrew: You are?
Nick: I actually had some issues with that I should probably ask someone over at elasticsearch what was going on but when I wanted to go add the data-only nodes and master-only nodes, the master... some of them weren’t joining the cluster.
What I ended up doing in that case was just completely over-provisioning ourselves and adding a ton more servers. That’s served [us] pretty well since then. We’re definitely working on some of our cluster provisioning, just the operation side of things. One thing that we’re working on, sort of a sneak peak, pre-announcement, a little scoop for you–and this will drive, my discussion of what’s different about how we do things– actually let me start with that.
The other main thing that’s different about what we do is we create this shared cluster concept of elasticsearch where we’re actually doing multi-tenancy within a single elasticsearch cluster.
I guess that’s not exactly a secret. On our site we talked about that in a few blog posts and everything. What I mean by that, is that we emulate the elasticsearch API in front of elasticsearch. We have this proxy layered in front of elasticsearch. It’s built-in Java on netty and we have these handlers that white list the different actions within elasticsearch that are the 80 or 90 percent most common use cases. Then we map your URL your identifiers, we enforce authentication and all that good stuff. All against a single elasticsearch cluster where we map your logical naming into our internal naming.
That’s an interesting approach that we take to elasticsearch that I think works great in this sort of cloud environment where you’re not running your own servers because you probably don’t really have a huge use case. We can give people some really great economies of scale by using this shared cluster concept.
That said, we are continuing to work on all our operational stuff in order to do dedicated clusters as well so you have this self served, full access to the entire cluster API, dedicated clusters.
Andrew: That would be an upgrade option if you start to outgrow the shared one, you’ll have your own cluster that you can crash if you want to.
Nick: Right, exactly, where if you know that you absolutely need all the features inside of elasticsearch or you want our maximum guarantees in terms of performance and of uptime and availability, you can graduate from the “try elasticsearch, see how it feels” smaller use case, up to the more heavy duty dedicated cluster stuff. We’re dogfooding a lot of that actually because we’re using a lot of that to manage our existing shared clusters. I do think that it’s wise to separate out... at least like the master-only nodes.
In general we try to run our nodes such that they’re not getting insanely overwhelmed by load. The reason you would do that is because you have a node that’s doing three jobs, its storing data and processing search requests and things like that. It’s serving as HTTP client traffic terminator where it receives your requests. It plans who to distribute that request out to and consolidates the results. Also it is the elected master of the cluster which is sort of the end authority on the state.
If that one node gets overloaded, now you’re potentially facing latency when it’s handling your HTTP request and distributing that to other servers. You can potentially facing a situation where any operation that modifies the cluster state is going to have to wait for that master to respond. Now creating and deleting indexes are going start hanging up a little bit. Those are the major ones that we see whenever we get alerts or reports from customers that index creation is failing. We know that probably some node who happens to be the master is a little more loaded it ought to be.
Those are good things to be aware when you’re your scaling elasticsearch. Especially given that, elasticsearch does so much work behind the scenes. That’s the one thing they really do give you some control over is which nodes are doing which roles. If you can have a good picture of that then you’ll be in good shape.
Andrew: One other question I had for you. If you could go back in time and give yourself, when you’re starting, I guess maybe you weren’t be a good person to ask this question because you already had this question because you’d already been using Solr for a long time at that point... For new elasticsearch developers, what are some little pieces of advice you’d give them that a lot of people are unaware of when they start... Some hard won knowledge that would be of aid to the new developer.
Nick: You mean in terms of using elasticsearch?
Andrew: In terms of getting started like things that people... there’s certain things you have to learn the hard way, there’s something things that... were that process could be saved with a little bit of advice up front.
Nick: That’s a great question. I think the update things, updates in batches is actually really big one. I think that’s probably one of the most common ways that people crash their own clusters is by... you know they’re serving some reasonable level of production traffic off their search engine. Okay, well, I’m going to go reindex because we need to add a new field or we change our analysis somewhere. I’m going to reindex and oh my goodness, why is our production search is crushing?
That’s something that’s really worth being aware of and like that’s something that I find it really does... fit that description of hard won knowledge. We’re fortunately in a position where we can encourage better use cases. So we actually throttle like concurrent connections per different types of request. Your search requests get a different bucket than like your update request and than your bulk request also get their own bucket... to encourage people... we’ll send you errors before we let you overwhelm the shared cluster setup. Which I think is probably... it can be painful for some folks who really benefit from the highly parallel indexing but it’s better for everybody in the long run to be really, really conservative about that stuff.
Being really good at indexing it really comes back to that, if you want to sleep well at night when it comes to your search engine that’s definitely a good piece of knowledge to understand. Then from there, understanding a little bit about sharding. If you can have some answer to what’s your sharding scheme, even if it’s... we create a primary shard per million documents, something as arbitrary as that then you’ll probably be in better shape than someone who creates an index with a single shard and grows to many, many gigabytes on some EC2 instance for example, which can be done depending on the kind of searches you’re doing but it gets to be a little bit more undefined.
Andrew: Of course. And lastly what kind of features would you like to see in elasticsearch that aren’t there presently?
Nick: Yeah, great question. Some of the stuff I want I’m pretty sure are being worked on. One of them is like a good API for fetching snapshots of your index. This plays in to the ability to do cross region replication... is the ability to fetch the data from one region into another. To get support for that would be really interesting to see.
I should have prepared for question because I think I have a really good answer to that. I’m not remembering right now. Yeah, that’s a great question. I think as far as features. I remember now. I actually saw this in the Lucene bag tracker awhile back and I think it was from an elasticsearch developer. The ability to specify a sort field in advance and to be able to pass that to your merge policy so some sort of sorting merged policy.
We see people trying to for example, index a stream out of Twitter and search, and then default order by recency. Lucene actually is... the way that the segments are generated, it’s more biased to give you results more quickly out of your first written documents.
There’s a lot of devices where your oldest documents and in so many use cases when we’re talking about search engines on websites, people really want the new and fresh stuff. If you have a reasonably complex query and you’re trying to order by recency, you’re really hurting your performance by asking elasticsearch to do that sort for you because it has to go through and sort all the results that it found on all the shard it found them on.
Odds are, you’re going to end up discarding over 90 percent of that when you just ask for that first page back. The ability to actually do sorting and merge sorters would be really interesting I think both Solr and elasticsearch will benefit a lot from that.
Andrew: Would that be case where given the current architecture of elasticsearch you would use a temporal index if that was an on-going problem?
Nick: Kind of but not really because this goes down to the segment level. It’s particularly evident when you... I forget what the name of the parameter is but you can specify like the amount of time the elasticsearch is allowed to run its search which is great feature. The idea there being, if you’re getting a ton of results back then maybe you search the pretty broad term and any result is going to be as good as any other.
That’s the original logic behind capping the amount of time you can search spends. I think that goes back to a day and a design from Lucene when recency wasn’t really... there’s just so much more data being generated every minute on the typical and social web apps these days. Yeah, just the ability to invert that merge sorting order and do like give it a bias per segment of like the newest documents first I think would be awesome for everybody.
We actually have experimented a little bit with that. My business partner, Kyle, the Java, Lucene expert, he’s written a merge policy that does that. It’s tough to just mash up the configuration and be able to specify the field and do that in like a really user-friendly way.
We haven’t really deployed that on a wide scale outside of a couple of specific known uses. That would be really interesting to me. I think that would be really interesting to a lot of other people.
Andrew: Yeah, that sounds like it would be interesting to me actually personally as well. That wraps up the questions I had for you. Is there anything else do you wanted to add while I got you?
Nick: Well, yeah, good question. I think that it’s really interesting that elasticsearch developers and just search engines in general are just this renaissance area a little bit in web development. I’m really gratified to see everyone’s interest in that. It’s really cool to see everyone learning together.
I think that I would just put out a general open call to more contributors, especially in terms of client-side integrations. Especially elasticsearch is relatively younger. If you want to be a widely adopted search engine you’re need to focus your efforts at having really good vertical top-down integration in every framework out there that people are developing in.
I don’t know, anyone who’s doing work with elasticsearch, I would encourage them to like go look at their client of choice under the hood a little bit more. If one doesn’t exist, write it.
I think that the online book that you’re building is really great for that. And I’m certainly are willing to have conversation like this with anyone who’s interesting.
I’ve been sort of toying with the idea of maybe doing a weekly Google hangout with interested elasticsearch client developers at some point. If there’s any interested out there in something like that I’d be happy to encourage that sort of thing because that’s good for everyone. The more people using elasticsearch, the more mind shared goes into improving it. Ultimately you end up with better tools for developers and better websites for all of our users.
Andrew: As a client developer I would say that I’m a fan of that. You have my full support on that as well.
Nick: Maybe my other question too would be tell me a little bit about Stretcher, how is it going and what are some of your plans there.
Andrew: People listening who are not Ruby developers and for those who are who had not heard of the gem, there’s a gem called stretcher which I’m the author of. It’s a lightweight Ruby elasticsearch client. Basic idea is that it’s not integration with a framework for rails. It’s actually just a ruby client for elasticsearch that helps you out in terms of talking to the server. Elasticsearch mostly speaks with HTTP you need and use cURL like all the examples on the website do, or a plain Ruby HTTP client which works okay.
A lot of the APIs have weird data formatting stuff especially the bulk APIs that you probably don’t want to have redo from scratch. There’s also logic in there as far as efficient connection use.
That’s nice, there’s good logging stuff in there as well. Stretcher has been going really well. We’ve seen a lot of interest in it. Right now we’re in the middle of reworking actually our key Search API which was... which works fine right now but there’s some improvements to be made. This will be perhaps our first ever... I’m both proud and sad to say this, our first ever breaking API change.
I’ve tried very hard to keep the API from fully consistent but we’ll be releasing 2.0 at some point. It would be a minor change. I think it would be an easy upgrade for almost everybody but it all help with some consistency and a couple ... mostly on consistency things and making sure you can get access to certain data in certain ways.
It’s going well and there’s been some good involvement from the community. We’re having an on-going discussion about the future of the API. If you’re a user and this goes out in time before we release it. There’s a discussion in our Github issues right now.
Nick: Right on. Yeah, I think the client side of things is so important because bottom line that is what people will ideally spend most of their time interacting with is actually writing code for the thing. it’s a much smaller set of developers and engineers that are going to be might be working on scaling the thing and that’s necessary and good especially in today’s age of cloud hosting... elasticsearch is a breath of fresh air in terms of running it yourself. The whole client experience is where a lot of that effort needs to go into.
Andrew: I’ll add something to the client discussion which is one thing... one of my tips for people who are starting with elasticsearch that I wished I had known. I tried to be much in client too early. You’re right it is a primary experience because that’s what ultimately the end of goal of all your code is.
But when learning elasticsearch and trying out new queries, I actually don’t use our Ruby client though. There are number of front ends for elasticsearch that are mostly browser based. I’m the author of one called the elastic hammer but if you go to the front ends page on elasticsearch.com there’s a bunch of other ones. I highly recommend, when you’re building queries, experimenting with one of these because the feedback loop between trying to integrate a whole new search query into your application and playing around in a REPL is huge.
One thing I didn’t realize early on is how experimental search queries are. They’re not like a database query where you’re right or you’re wrong and maybe you tweak it if you have a bad query plan. You spend a lot of time thinking, especially when you haven’t written a very many of them, that “I know how I’ll solve this” then you implement it, spend a lot of time and get the results back and they suck.
Just approaching queries as throwaway code until you’re pretty sure they’re going to work is a really, really important thing I found. I try and stay as far away from actual implement as possible in writing these days until I’m pretty sure they’re going to work.
Nick: Good advice.
Andrew: I guess that... actually, I got to run in a bit Nick but thanks for joining me today!