Unblocking that MapReduce pipeline

One To Watch: Apache Crunch

Chris Mayer

Not a breakfast cereal, but yet another big data solving project straight out of the ASF.

Over the past few years, the Apache Software Foundation has become the hub for big data-focused projects. An array of companies have recognised the worth of housing their latest innovative projects at the ASF, with Apache Hadoop and Apache Cassandra two shining examples.

Amongst the number of projects arriving in the Apache Incubator was Apache Crunch. Crunch is a Java library created to eliminate the tedium of writing a MapReduce pipeline. It aims to take hold of the entire process, making writing, testing, and running MapReduce pipelines more efficient and “even fun” (if this Cloudera blog post is to be believed).

Like so many similar projects before it, Crunch was inspired by work at Google, using concepts from their 2010 FlumeJava paper, which explores the idea of data-parallel pipelines. Crunch intends to make managing multi-stage pipelines far easier for those who need it most – the enterprise world who are now picking up Hadoop.

The team behind Crunch have gone a step further, creating a idiomatic Scala API equivalent in Scrunch, although it’s still an alpha build as they seek for expert Scala committers.

The idea of applying concepts from the Google paper into a fully-blown project has been knocking around for more than 12 months, as Cloudera’s Data scientist Josh Wills detailed work behind it in October 2011. It’s not just Cloudera leading the project though, since key figures from different Hadoop vendors have joined the initiative, including Hortonworks Arun Murthy, who also leads the direction for the MapReduce project.

Crunch has since been donated to the Apache Software Foundation, incubating in May this year. Fast forward to the present day and Crunch appears to be showing steady progress. The documentation does account for some known limitations in Crunch, notably how to split processing tasks between dependent MapReduce jobs.

But with plans for a seamless read/write connection to the table storage service HCatalog, we think the future’s bright for this one. Hopefully we’ll see further progress in the new year, as Crunch heads tentatively the Incubator exit door. In all likelihood, we’ll see further JVM options too such as JRuby and Clojure. When it does escape, the Java library could be the MapReduce pipeline unblocker you need.

Photo courtesy of Horia Varlan.

Inline Feedbacks
View all comments