Beyond batch

Cloudera unveil real-time query engine Impala

Right in the midst of conference season, it’s no surprise to see the Hadoop vendors showcasing their wares now that the big data ‘kernel’ has permeated the enterprise world.

Yet there’s no beating around the bush here - there’s still an array of concerns for companies considering adopting the data munching technology for their large scale data needs. The most pressing is of course real-time querying and processing.

Plenty of work at Apache Hadoop HQ has gone into adding extra functionality to the batch-oriented platform, through projects like Hive and HBase, but we’re still a way off the speed and business intelligence capabilities that enterprises crave.

This week’s Hadoop World 2012 could prove to be the most important battleground yet for the likes of Cloudera, Hortonworks, MapR and Greenplum, as they attempt to assure prospective customers that they should stick with them for the coming months, and (most importantly) give them a glimpse into the future.

After two years of closely guarded development, Cloudera yesterday unveiled Impala, a speedy distributed parallel SQL querying engine. Tailored to query in real-time against the standard Hadoop File Distributed System (HDFS), and/or the NoSQL choice HBase, it could prove to be a massive leap for the enterprise, being the first concrete effort to emerge offering real-time capability.

Cloudera assure us on the project page that it isn’t intended to displace the batch processing frameworks of old, but as an additional tool in the Hadoop arsenal. After all, one of the reasons Hadoop has risen to prominence is through its vibrant ecosystem of projects, which has served to consistently push the envelope.

The ties to Google’s F1 system, revealed back in May this year, are well documented. In fact one of the key people behind its query engine, Marcel Kornacker, was hired by Cloudera to lead this effort. Chief Architect at Cloudera, Doug Cutting, recently told JAXenter of Google “being ahead of the curve” and giving us all “a roadmap of where we could go.”

Like so many projects in the space, Impala has been open sourced under an Apache license, inviting the Hadoop community and other companies to help push it along. Initially being offered in a private beta (and reportedly piloted by a number of companies such as Expedia), Impala has now arrived to the public. Cloudera have dubbed this the 1.0 release, but are clear to intimate that this is purely the beginning of things for them. Impala should be rolled out on its own in 2013, when Cloudera’s CDH5 appears.

Interestingly, competitor MapR have also put their real-time chips down, leading another Apache project, Drill. Whilst the two have notable differences (Drill based on Google’s interactive analytic system Dremel), they’re both reflecting the desire for real-time querying in Hadoop. You’d likely assume the other big Hadoop vendor Hortonworks would want to dip their toes in soon as well. As in most industries, increased choice can only be a good thing.

For all Hadoop’s successes, there needed to be a push beyond batch. Now we’ve got the signal from the vendors that they’re ready to answer the real-time call.

Image courtesy of Ludovic Hirlimann

Chris Mayer

What do you think?

Comments

Latest opinions