Spotlight on: Elasticsearch
We talk to Shay Banon, creator and co-founder of the distributed RESTful search engine.
In the June issue of JAX Magazine, we talked to the creator and co-founder of the distributed RESTful search engine making enterprise waves.
JAX Magazine: Can you give some background to the company and how it came about?
Shay Banon: Elasticsearch was open sourced about three years ago, and I quit my day job to do it full-time. The project was getting more and more successful, gaining broad adoption and being used as a mission critical component in applications. Companies using it were actively looking for professional support around it.
I already knew Uri Boness and Simon Willnauer through my history around Lucene and Search in general, and Uri connected me with Steven Schuurman, one of SpringSource’s co-founders. I flew over to Amsterdam and we all hit it off, and decided to form together the Elasticsearch company as it is today.
JAXmag: Apache Lucene – what is so good about that project in your view and what is possible to achieve with it?
Banon: Apache Lucene is a wonderful Information Retrieval library. It has some of the best minds behind it, and is the de facto standard when it comes to low level IR work, purely on merit. It is a library though. In order to use it, one has to program in Java (or the JVM) and effectively embedding Lucene while knowing its internal API usage, and understand its design intimately in order to make the best use of it.
JAXmag: Why is it the foundation of Elasticsearch?
Banon: [It] forms a strong foundation for Elasticsearch to build on. In Elasticsearch, we managed to utilize Lucene for what its best for, making sure its used as it should be, and extended it in order to achieve some of the broader goals of Elasticsearch as a distributed search and analytics engine. We also managed to strike a nice balance for people who are already familiar with Lucene, yet prefer to use it through an “over the wire” API where Elasticsearch takes care of the distributed API execution, data distribution, and so on. For example, our search API provides an easy to use Query DSL that maps very nicely to how queries in Lucene are actually represented.
We did extend Lucene, for example, in adding additional geo query capabilities, and adding stronger consistency and atomicity model for the data stored in Elasticsearch. Other features we have built from the ground up, like our analytics/ aggregation engine providing real-time aggregations over billions of documents.
Another example is the fact that we take care of distributed data and the search requests automatically, potentially across tens or hundreds of machines. This include other important aspects that are outside of the context of a library and more towards a distributed runtime environment. For example, we have a sophisticated evented networking layer that allows you to execute a distributed search request across hundreds of machines, and receive the responses in milliseconds.
JAXmag: Can you outline some key goals from a design perspective with ElasticSearch – how is it distributed/highly available for example?
Banon: Building a distributed system is not easy. Careful thought needs to be taken on all elements that form it, starting from how networking is done to perform cross node communication, how nodes discover each other, and ends with how data is balanced across all available machines.
We do all the “regular” things one would expect from a high end distributed system. We allow the user to easily partition the data, and make sure that the data has multiple replicas or copies for high availability. Elasticsearch stands on the more proactive side of distributed systems, taking automatic action in moving data around to make use of more machines if they are added to the cluster, or reallocating data in case of machine failure.
One thing that we take deeply to is the notion that not all the data is the same, and we strive hard in Elasticsearch to allow to user to easily convey it with our system. For example, in a logging scenario, where Elasticsearch is being used more and more, old logs are probably not as important as more recent logs. New logs keep on being generated, and take the lion share of the searches. With Elasticsearch, its easy to build a system that takes that into account, and prioritizes resources for recent or new data, compared to older one.
JAXmag: Why JSON over HTTP?
Banon: JSON has become the de-facto standard to represent data in the past few years. And HTTP has become the defacto transport for it. By standardizing on JSON and HTTP, we make sure that Elasticsearch can be easily integrated into any environment, regardless of a programming language, framework, or technology stack used.
I would add that how a system uses JSON is as important as the fact that it uses JSON. As with any data format, JSON can be heavily abused, and we take extra care in designing our API in the most easy to use and consumable manner. For example, our histogram aggregation component returns the data in a format that can easily be used to drive almost any charting library out there.
The same applies to HTTP, by the way. It’s not enough to just say we support HTTP, we take HTTP to heart. For example, when a document is indexed in Elasticsearch and there is a conflict due to optimistic concurrency control logic, we return the HTTP status code 409 (CONFLICT).
JAXmag: How important is the schemaless approach?
Banon: Elasticsearch is semi-schemaless, if that’s a definition. The dataset a user pushes into Elasticsearch ends up auto-defining the schema that will be used. Though that dynamic definition is possible, users have all the power to explicitly define a schema for their JSON documents.
We feel it’s important because it helps users get started with Elasticsearch, simply getting data, formatting it in JSON (that has some notion of types), and storing it in Elasticsearch. It also serves an important aspect in real systems, where data ingested might be undefined, and Elasticsearch allows for an evolvable schema over time.
JAXmag: What makes Elasticsearch different to Solr?
Banon: To be honest, I know that a lot of people like to compare Elasticsearch to Solr, I personally think that the systems are quite different in the breadth of problems they try and solve. Solr was born in the single server, Enterprise Search era, and has proved to be a great solution for that. Elasticsearch inception and implementation started with a deep understanding of the fact that systems today require advance distributed capabilities and, yet be easy and digestible to use under today advance technology stacks. REST is a good example. Also, we at Elasticsearch have a broad vision for it, with prime example is our real time analytics capabilities that took a lot from my personal experience of building distributed in memory data grids.
JAXmag: You have some very impressive clients using the project such as Foursquare, Soundcloud and Github, all of whom have varying use cases. Does that show the malleability of ElasticSearch?
Banon: Only slightly. The breadth of use cases Elasticsearch helps solving is quite amazing. Elasticsearch combines a unique holy triangle of data exploration capabilities, unstructured search, structured search, and aggregations or analytics.
By combining those three areas in a single product, users find themselves empowered with what they can do with their data.
For that reason, the data content itself, though interesting, becomes irrelevant to a degree. If implemented properly, there isn’t a big difference between a location on Foursquare, music on SoundCloud, code/issues on GitHub, trades in a bank, audit logs, for example, in financial institutions, and so on. And we see all of those use cases and more use Elasticsearch to help make sense of their data.
JAXmag: Moving forward – what’s planned for the company?
Banon: We are focused on continuing to improve the product itself, and not just the “core” Elasticsearch, but also Lucene, and the ecosystem that has developed around Elasticsearch. Language clients is a great example [of that]. We have a lot of work left around improving other aspects of the product, like our documentation, which we are actively working on.
Aside from the product, we will continue to invest heavily in providing our different services around Elasticsearch, namely our production support, development support and training. We also focus on starting to taking more active role in conferences and meetups, to help educate people around Elasticsearch. Though, I must admit, it starts to be challenging to find people that haven’t heard of it.