Unblocking that MapReduce pipeline

One To Watch: Apache Crunch

Chris Mayer
Apache-Crunch1

Not a breakfast cereal, but yet another big data solving project straight out of the ASF.

Over the past few years, the Apache Software Foundation
has become the hub for big data-focused projects. An array of
companies have recognised the worth of housing their latest
innovative projects at the ASF, with Apache Hadoop and Apache
Cassandra two shining examples.

Amongst the number of projects arriving in the Apache
Incubator
was Apache Crunch. Crunch is a
Java library created to eliminate the tedium of writing a MapReduce
pipeline.
It aims to take hold of the
entire process, making
writing, testing, and running
MapReduce pipelines
more efficient
a
nd “even fun” (if this Cloudera
blog post
is to be believed).

Like so many similar projects before it, Crunch was inspired
by work at Google, using concepts from their 2010

FlumeJava
paper, which explores the idea of
data-parallel pipelines. Crunch intends to make managing
multi-stage pipelines far easier for those who need it most – the
enterprise world who are now picking up Hadoop.

The team behind Crunch have gone a step further, creating a
idiomatic Scala API equivalent in

Scrunch
, although it’s still an
alpha build
as they seek for expert Scala
committers.

The idea of applying concepts from the Google paper into a
fully-blown project has been knocking around for more than 12
months, as Cloudera’s Data scientist
Josh
Wills detailed
work behind it in October
2011.
It’s not just Cloudera leading the project
though
, since key figures
from different Hadoop vendors
have joined
the initiative, including Hortonworks Arun Murthy, who also
leads the direction for the MapReduce project.

Crunch has since been donated to
the Apache Software Foundation, incubating in May this year.
Fast forward to the present day and Crunch appears to be showing
steady progress.
The
documentation
does account for some known
limitations in Crunch, notably how to split processing tasks
between dependent MapReduce jobs.

But with plans for a seamless read/write connection to the
table storage service HCatalog, we think the future’s bright for
this one. Hopefully we’ll see further progress in the new year, as
Crunch heads tentatively the Incubator exit door. In all
likelihood, we’ll see further JVM options too such as JRuby and
Clojure. When it does escape, the Java library could be the
MapReduce pipeline unblocker you need.

Photo courtesy of Horia
Varlan
.

Author
Comments
comments powered by Disqus