Which distro to pick?

Hadoop: The State of Play

Chris Mayer

With partnerships and deals being struck up daily, we take stock of the Hadoop space and who has positioned themselves the best for the coming months. Some are suggesting it won’t matter though…

In the year which Big Data finally comes to the top of the
agenda, now seems an apt moment to take stock of it all – in
particular the technology that seems to be acquiring the most
attention, Hadoop.

For those of you who have somehow dodged talk surrounding the Java
crunching framework, many Hadoop-centric companies  have
sprung up over the past few years, all aiming to come out on top in
this Big Data game of thrones by offering the best support
available. June, in particular, was a peak for the world of Hadoop.
Hadoop World 2012 was the venue for where most vendors outlined
their plans moving toward the next iteration of the HDFS/MapReduce
technology, but also there were seismic shifts, with some tying up
deals with big partners. The race is well and truly on – let’s see
who has placed themselves best with the plans for Hadoop 2.0 well

The Contenders


Arguably the biggest movers in June, notching up two big
partnerships. Two weeks after Amazon gave their thumbs up for using
MapR’s high availability M3 (open source) and M5 (enterprise)
Hadoop software distributions on AWS, up stepped Google with an
equally enticing offer. Google Compute Engine, their burgeoning
Infrastructure-as-a-Service which debuted at Google I/O, will link
up with MapR to offer customers the opportunity to quickly
provision large MapR clusters with the backing of a massively
scalable cloud-based solution.

It’s quite intriguing that Google have chosen a third-party for
their Hadoop needs, given that they were the ones who originally
inspired the creation of the technology, through publishing a white
paper on their own MapReduce and file system back in the day. But
perhaps Hadoop has passed them by, opting to focus on other
services and developing fields, a shrewd move given the depth of
Google’s pockets. Google still uses MapReduce processing
extensively internally, but chose not to distribute.

Although this may currently be in private beta, it certainly
sends a big message to the rest – MapR just inked deals with the
two biggest players around. With Amazon only offering MapR within
their Elastic MapReduce service (EMR), it’s an indication that the
two big guns have opted for their architecture over Cloudera and
Hortonworks – despite having an inferior distro. MapR are
positioned well for the coming months, as their Hadoop distribution
is planned to go GA in Q3.


The original innovative hub, spun out of Yahoo!, chose Hadoop
Summit as the stage to deploy their own distribution, Hortonworks
Data Platform (HDP)
 which despite Cloudera’s headstart
represents strong competition. With much of Hadoop’s early
development down to the guys at Hortonworks, they can boast
longevity as their key asset.

They also revealed that HDP will only be based on the original
Hadoop 1.0 codebase, shying away from Hadoop 2.0, unlike Cloudera
and MapR. Hortonworks will obviously play a big part in the open
source Apache Hadoop version along the way, given their input thus

Whilst some might see this as a fatal decision, it is in fact a
sensible one – get the most comprehensive offering based on your
origins correct and then work from there. HDP is certainly an
advanced distribution with many core components alongside HDFS
for storage and MapReduce for distributed processes.

Users can hook up to HBase for nonrelational databases, the
scripting project Pig for scripting, Hive for query, Zookeeper for
management and monitoring through HCatalog. Pairing up with VMware
will also aid them in other areas. Hortonworks is a tried and
tested solution from a provider that has been around the Big Data
block – it is clear they know what works.


Comparatively quieter than their rivals, Cloudera keep on
rolling with their CDH distribution. The
advantage of being first out of the blocks has helped them acquire
valuable experience and, more importantly, fuel innovation in the
sector. By supporting RackSpace and AWS public cloud support since
2009, Cloudera have made their name as the first port of call for a
Hadoop stack.

They’ve also been at the heart of developing side projects,
namely Apache Whirr, which is in operation at its competitors,
helping to run Hadoop distributions on public clouds. Just
recently, IBM announced that they were welcoming in CDH as an
option for their BigInsight customers, despite previous resistance
to specialised Hadoop vendors. This indicates the balance of power
currently lies with Cloudera. After all they were there first and
with creator of Hadoop, Doug Cutting, as Architect, you don’t have
a better person to offer guidance.

There’s a danger that Cloudera might lose out in the long run
but its infrastructure and consistent innovation will make sure
it’s at the top table for some time to come. With them pushing
forward the newer features of Hadoop 2.0 in their releases right
away, you know that they will drive the technology to new


Three companies all with unique selling points. MapR’s
aforementioned sophisicated architecture is attaining some
leverage, Hortonworks’ steady approach means they could have a
top-dollar Hadoop 1.0 distro at their disposal, whilst Cloudera
continue to set the pace for innovation.

For the immediate future at least, it appears to be a straight
shootout between Hortonworks and Cloudera for marketshare, but take
a look at the horse coming up on the outside, MapR. Its deals in
June could have a lasting impact in the

Whilst some
 Hadoop has had its time in the sun as the Big Data
standard, the moves made recently suggest that the yellow elephant
may just have something to say about that.

comments powered by Disqus