Beyond batch

Cloudera unveil real-time query engine Impala

Chris Mayer
Impala - Aepyceros melampus

The Hadoop vendor are the first to really capitalise on real-time with this new open source querying project. Will it pave the way for the rest to follow?

Right in the midst of conference season, it’s no surprise to see
the Hadoop vendors showcasing their wares now that the big data
‘kernel’ has permeated the enterprise world.

Yet there’s no beating around the bush here – there’s still an
array of concerns for companies considering adopting the data
munching technology for their large scale data needs. The most
pressing is of course real-time querying and processing.

Plenty of work at Apache Hadoop HQ has gone into adding extra
functionality to the batch-oriented platform, through projects like
Hive and HBase, but we’re still a way off the speed and business
intelligence capabilities that enterprises crave.

This week’s Hadoop World 2012 could prove to be the most important
battleground yet for the likes of Cloudera, Hortonworks, MapR and
Greenplum, as they attempt to assure prospective customers that
they should stick with them for the coming months, and (most
importantly) give them a glimpse into the future.

After two years of closely guarded development, Cloudera yesterday
unveiled Impala, a speedy distributed parallel SQL querying engine.
Tailored to query in real-time against the standard Hadoop File
Distributed System (HDFS), and/or the NoSQL choice HBase, it could
prove to be a massive leap for the enterprise, being the first
concrete effort to emerge offering real-time capability.

Cloudera assure us on the project page
that it isn’t intended to displace the batch processing frameworks
of old, but as an additional tool in the Hadoop arsenal. After all,
one of the reasons Hadoop has risen to prominence is through its
vibrant ecosystem of projects, which has served to consistently
push the envelope.

The ties to Google’s F1
system,
revealed back in May this year, are well
documented. In fact one of the key people behind its query engine,
Marcel Kornacker, was hired by Cloudera to lead this effort. Chief
Architect at Cloudera, Doug Cutting,
recently told JAXenter
of Google “being ahead of
the curve” and giving us all “a roadmap of where we could
go.”

Like so many projects in the space, Impala has been open sourced under an Apache
license
, inviting the Hadoop community and other companies to
help push it along. Initially being offered in a private beta (and
reportedly piloted by a number of companies such as Expedia),
Impala has now arrived to the public. Cloudera have dubbed this the
1.0 release, but are clear to intimate that this is purely the
beginning of things for them. Impala should be rolled out on its
own in 2013, when Cloudera’s CDH5 appears.

Interestingly, competitor MapR have also put their real-time chips
down, leading another Apache project, Drill. Whilst
the two have notable differences (Drill based on Google’s
interactive analytic system Dremel),
they’re both reflecting the desire for real-time querying in
Hadoop. You’d likely assume the other big Hadoop vendor Hortonworks
would want to dip their toes in soon as well. As in most
industries, increased choice can only be a good thing.

For all Hadoop’s successes, there needed to be a push beyond batch.
Now we’ve got the signal from the vendors that they’re ready to
answer the real-time call.

Image courtesy of Ludovic
Hirlimann

Author
Comments
comments powered by Disqus