search
Shay Banon interview

A brief history of Elasticsearch

Diana Kupfer
Shay Banon, founder of Elasticsearch

Elasticsearch founder shares his experiences with Lucene and tells us how Elasticsearch all started with a simple cooking app for his wife.

From GitHub to The Guardian, numerous major Enterprises have turned to Elasticsearch to help make sense of user interaction data and improve their search results. Diana Kupfer of the JAX team caught up with the search engine’s founder, Shay Banon.

JAXenter: You started Compass, your first Lucene­-based technology, in 2004. Do you remember how and why you became interested in Lucene in the first place?

Shay Banon: Reminiscing on Compass birth always puts a smile on my face. Compass, and my involvement with Lucene, started by chance. At the time, I was a newlywed that just moved to London to support my wife with her dream of becoming a chef. I was unemployed, and desperately in need of a job, so I decided to play around with “new age” technologies in order to get my skills more up­to­date. Playing around with new technologies only works when you are actually trying to build something, so I decided to build an app that my wife could use to capture all the cooking knowledge she was gathering during her chef lessons.

I picked many different technologies for this cooking app, but at the core of it, in my mind, was a single search box where the cooking knowledge experience would start a single box where typing a concept, a thought, or an ingredient would start the path towards exploring what was possible.

This quickly led me to Lucene, which was the defacto search library available for Java at the time. I got immersed in it, and Compass was born out of the effort of trying to simplify using Lucene in your typical Java applications (conceptually, it simply started as a “Hibernate” (Java ORM library) for Lucene).

I got completely hooked with the project, and was working on it more than the cooking app itself, up to a point where it was taking most of my time. I decided to open source it a few months afterwards, and it immediately took off. Compass basically allowed users to easily map their domain model (the code that maps app/business concepts in a typical program) to Lucene, easily index them, and then easily search them. That freedom caused people to start to use Compass, and Lucene, in situations that were wonderfully unexpected. Imagine already having the model of a Trade in your financial app, one could easily index that Trade using Compass into Lucene, and then search for it. The freedom of searching across any aspect of a Trade allowed users to convey this freedom to their users, which proved to be an extremely powerful concept.

Effectively, this allowed me to be in the front seat of talking and working with actual users that were discovering, as was I, the amazing power that search can have when it comes to delivering business value to their users. Oh, and btw, my wife is still waiting for that cooking app.

Now, 10 years later, it is the basis of Elasticsearch.

What do you appreciate most about the Lucene project and the technology itself?

I will start first and foremost with the people that are involved with the Apache Lucene project. Though I was probably around earlier than most current Apache Lucene developers, I have seen people come over and invest in Lucene to bring it to a level that I never imagined it would get to. I spent countless hours talking to people like Mike McCandless and Simon Willnauer about many different aspects of Lucene during both the Compass and early Elasticsearch days.

When we started Elasticsearch, I wanted to make sure that we built a company that would sustain this rapid development pace and level of excellence that goes into Apache Lucene. Right from the start, the thing that got me excited was that we were already forming it with people who are experts in Lucene, like Simon, Uri Boness, and Martijn Van Groningen.

Even so, if you asked me back then, I would never have imagined that today we would be able to attract such a strong team focused on Apache Lucene. It’s humbling to see people like Mike, Robert Muir, Mark Harwood, Adrien Grand, and newish Lucene committers such as Areek Zillur and Ryan Ernst join our company, and help push Lucene forward to an even better and stronger state than it is today.

“Pushing” and “stretching” Lucene

If you look at Elasticsearch, it’s wonderful to see the areas Elasticsearch “stretches” Apache Lucene. For example, the need to push Apache Lucene to be more resilient, an unexpected (from Lucene perspective) emphasis on indexing speed, or the visibility into Lucene in terms of instrumentations.

There isn’t a day that goes by where I don’t hear one of the Lucene developers saying “I never thought Lucene would be used to solve this use case,” or be used in such a way. To me, this is wonderful, since we are creating an environment where there is a creative nexus happening around individuals using a platform in Elasticsearch that causes Apache Lucene to move forward, and at the same time we, as a company, have the ability to give one of the most talented group of developers I have ever worked with the opportunity to make it happen.

As for the technology itself, I think Apache Lucene is, and continues to be, one of the best technology innovations out there. It’s enough to say things like Automatons, Finite State Transducers, Doc Values, Randomized Testing, and the list goes on. It’s a testament to the project also that Apache Lucene, I believe, is one of the only projects out there that consistently finds bugs in Java or the JVM, up to a point where Oracle seeks out the validation of a new Java version from the Apache Lucene project.

Having all that experience in building distributed systems, what would you say are some best practices and areas that require special attention?

I will start with the fact that building a distributed system is not easy, and there is a breadth of scope that goes with the term distributed system, so not all of it applies to all projects. One of the wonderful things that happened once Elasticsearch stopped being a one­man project and we formed a company around it, is the fact that we could invest so much more into the development of Elasticsearch itself.

Simon Willnauer, one of the company founders, took Elasticsearch and the team and started to push it to a whole new level. One of his first missions was to create a whole new testing infrastructure for it (inspired by Apache Lucene testing infrastructure, including Randomized Testing, which is a wonderful topic, by the way).

A new testing infrastructure

With the whole new level of effort that went into testing, I would say that one of the most important aspects of building distributed systems is the ability to test and validate its behavior. Conceptually, when someone thinks about distributed systems, it means different processes, on different machines, running over network. The immediate thought that follows it when it comes to testing is the fact that this is how it should be tested. This creates a very complicated testing harness that takes a long time to run, and very hard to debug when a failure happens.

“Freaky fast” data searches – via elasticsearch.com

At Elasticsearch, we’ve invested a lot of time building a test harness that can run our distributed tests as part of the “regular” integration tests. This means that with every run of tests one executes of Elasticsearch in a single JVM, a full cluster can be started and operated on. This includes simple things like rolling restarts while indexing data, but also more interesting tests like simulating network disconnections all the way to simulating long GC pauses, all of which are critical to be able to validate a distributed system behavior.

Being able to rely on such infrastructure and being able to run those tests very simply and quickly allows to build a more resilient system, and expose “dark corners” and edge cases more easily.

Obviously there is a lot more to building a distributed system, but as Simon says, “If it doesn’t have a test, how can you tell it works?” This is one of the most fundamental parts of building one.

What are the core criteria that a search server like Elasticsearch has to meet in order to keep up with today’s business demands?

It all ends up with what Elasticsearch tries to achieve, and at its core, its a very simple yet ambitious goal. Businesses today are drowning with data that they would love to be able to make sense of and extract actionable insights out of.They would like to do so in the simplest manner possible that would give them the biggest value possible.

It turns out that search is a wonderful way to do so. When I say search, I mean it in the broader sense of the word, where individuals, just like the cooking app I personally started with, are looking to gain insights and knowledge from data. When we say search at Elasticsearch, we mean the combination of free text search, structured search, and analytics, in their purest forms, regardless of the amount of data, all in real time.

Technology inspired by Minority Report

Do you remember the scene from Minority Report where Tom Cruise’s character interacts with the data he explores? Continuously learning and shaping it to what he is after, zooming in and out without any level of limitations in real­time? This is what we try to enable our users to do with Elasticsearch (though without the fancy hand waving interaction, at least not yet, but hey, Google started as “just” a search company).

It’s a hefty and ambitious goal, yet I deeply believe in our ability to execute on it. The amazing thing is that our users keep on encouraging us and validate that we are on the right path. Users tell us, daily, that what they manage to achieve with our products is something that they never would have thought was possible. They keep on finding innovative ways to use Elasticsearch, and I think this is one of the hallmarks of a great product, one that allows users to reach a level of creativeness that they initially never even imagined.

Elasticsearch is used by a wide range of organizations, including Foursquare, Wikimedia, GitHub or CERN. In an interview with JAXenter.com last year you said that „the data content itself, though interesting, becomes irrelevant to a degree“. Is this essentially the key to Elasticsearch’s success? 

It’s definitely a core aspect of it. One of the first hurdles that I had with Elasticsearch was trying to let users realize what I saw, to a much narrower degree user­base wise, with Compass, my previous open source project. Back then, when saying “search” to a user, they would not have immediately grasped all the possibilities that one can do with it.

I didn’t even grasp the scope of it 10 years ago. But, by building a technology that allowed users to map any domain model to “search” (namely, Lucene), it made crossing this mental barrier much simpler, and suddenly, an explosion of use cases started to happen where Compass and Lucene were used to power “not your typical search use case.”

With Elasticsearch, by standardizing on JSON and a RESTful interface, I saw users seeing it, but oh my was it in a much broader user base: developers from all varieties, from different programming languages, to different frameworks, to many many different use cases. And at the end, it all boils down to the power of search, and technically, the power of Elasticsearch and Apache Lucene. Implementation wise, it really doesn’t matter if the data is your typical web page / word document, or to a degree, a location on Foursquare, a trade in a bank, a web server log, or a metric of sorts. All effectively are a combination of structured and unstructured data that people want to explore, search through, have that Minority Report type experience, regardless of the shape or volume of the data.

This summer, Elasticsearch raised 70 Million Dollars in a Series C funding. What impact does this have on the company’s roadmap?

As you can see, we have very ambitious goals as a company. I will first mention that as a company, one of our highest values is to deliver a tangible value to our users, as quickly as possible. For that, we didn’t stop with “just” Elasticsearch. The first thing we did when we formed the company was make sure there are formal client drivers across all popular languages and frameworks out there. Today, we have clients and integrations with Ruby, Python, PHP, Perl, .NET and Java, with more to come. We wanted to make sure that the great experience users have with Elasticsearch goes all the way to their development platform of choice in the simplest form possible.

We didn’t stop there, and we also had the great fortune of having Rashid Khan join us to invest time in building Kibana, which is a wonderful way for people to easily visualize data that resides in Elasticsearch, exposing all the powerful aspects of it in real time. Same thing happened with Logstash and it’s creator, Jordan Sissel. We saw people using Logstash a lot to ingest data into Elasticsearch (as well as other systems), and wanted to make sure we gave Jordan the platform to focus on Logstash and build a strong team around it.

Elasticsearch and Hadoop

We also saw people using Elasticsearch in conjunction with Hadoop to give Hadoop the extra­real time boost it is sorely missing. Costin Leau joined us and helped build one of the best Hadoop integration modules out there with Elasticsearch, supporting all Hadoop distributions, including vanilla Apache Hadoop, MapR, Cloudera, and Hortonworks. As you can imagine, those are big projects, and a lot of investment goes into developing all the different products we have.

This is our main goal product wise, to make sure we build a strong foundation of products that keep on delivering on the core values that started with Elasticsearch. The company doesn’t stop there, though. Even though most of the people at the company are developers, we see such an amazing uptake from customers, that we need to make sure we build a business that is there to deliver everything our customers ask for.

By taking the investment, we can make sure that our company can execute not only on short term goals, but also take ambitious long term plans and start executing on them sooner. This makes me super excited, as the breadth of what we are trying to achieve is both humbling yet somehow possible.

This interview with Shay Banon originally appeared in Java Magazin.

Author
Diana Kupfer
Working at S&S Media since 2011, Diana Kupfer is an editor at Eclipse Magazine, Java Magazin and JAXenter.de.

Comments
comments powered by Disqus