Which distro to pick?
Hadoop: The State of Play
In the year which Big Data finally comes to the top of the
agenda, now seems an apt moment to take stock of it all - in
particular the technology that seems to be acquiring the most
For those of you who have somehow dodged talk surrounding the Java crunching framework, many Hadoop-centric companies have sprung up over the past few years, all aiming to come out on top in this Big Data game of thrones by offering the best support available. June, in particular, was a peak for the world of Hadoop. Hadoop World 2012 was the venue for where most vendors outlined their plans moving toward the next iteration of the HDFS/MapReduce technology, but also there were seismic shifts, with some tying up deals with big partners. The race is well and truly on - let's see who has placed themselves best with the plans for Hadoop 2.0 well underway.
Arguably the biggest movers in June, notching up two big partnerships. Two weeks after Amazon gave their thumbs up for using MapR's high availability M3 (open source) and M5 (enterprise) Hadoop software distributions on AWS, up stepped Google with an equally enticing offer. Google Compute Engine, their burgeoning Infrastructure-as-a-Service which debuted at Google I/O, will link up with MapR to offer customers the opportunity to quickly provision large MapR clusters with the backing of a massively scalable cloud-based solution.
It's quite intriguing that Google have chosen a third-party for their Hadoop needs, given that they were the ones who originally inspired the creation of the technology, through publishing a white paper on their own MapReduce and file system back in the day. But perhaps Hadoop has passed them by, opting to focus on other services and developing fields, a shrewd move given the depth of Google's pockets. Google still uses MapReduce processing extensively internally, but chose not to distribute.
Although this may currently be in private beta, it certainly sends a big message to the rest - MapR just inked deals with the two biggest players around. With Amazon only offering MapR within their Elastic MapReduce service (EMR), it's an indication that the two big guns have opted for their architecture over Cloudera and Hortonworks - despite having an inferior distro. MapR are positioned well for the coming months, as their Hadoop distribution is planned to go GA in Q3.
The original innovative hub, spun out of Yahoo!, chose Hadoop Summit as the stage to deploy their own distribution, Hortonworks Data Platform (HDP) which despite Cloudera’s headstart represents strong competition. With much of Hadoop's early development down to the guys at Hortonworks, they can boast longevity as their key asset.
They also revealed that HDP will only be based on the original Hadoop 1.0 codebase, shying away from Hadoop 2.0, unlike Cloudera and MapR. Hortonworks will obviously play a big part in the open source Apache Hadoop version along the way, given their input thus far.
Whilst some might see this as a fatal decision, it is in fact a sensible one - get the most comprehensive offering based on your origins correct and then work from there. HDP is certainly an advanced distribution with many core components alongside HDFS for storage and MapReduce for distributed processes.
Users can hook up to HBase for nonrelational databases, the scripting project Pig for scripting, Hive for query, Zookeeper for management and monitoring through HCatalog. Pairing up with VMware will also aid them in other areas. Hortonworks is a tried and tested solution from a provider that has been around the Big Data block - it is clear they know what works.
Comparatively quieter than their rivals, Cloudera keep on rolling with their CDH distribution. The advantage of being first out of the blocks has helped them acquire valuable experience and, more importantly, fuel innovation in the sector. By supporting RackSpace and AWS public cloud support since 2009, Cloudera have made their name as the first port of call for a Hadoop stack.
They've also been at the heart of developing side projects, namely Apache Whirr, which is in operation at its competitors, helping to run Hadoop distributions on public clouds. Just recently, IBM announced that they were welcoming in CDH as an option for their BigInsight customers, despite previous resistance to specialised Hadoop vendors. This indicates the balance of power currently lies with Cloudera. After all they were there first and with creator of Hadoop, Doug Cutting, as Architect, you don't have a better person to offer guidance.
There's a danger that Cloudera might lose out in the long run but its infrastructure and consistent innovation will make sure it's at the top table for some time to come. With them pushing forward the newer features of Hadoop 2.0 in their releases right away, you know that they will drive the technology to new heights
Three companies all with unique selling points. MapR's aforementioned sophisicated architecture is attaining some leverage, Hortonworks' steady approach means they could have a top-dollar Hadoop 1.0 distro at their disposal, whilst Cloudera continue to set the pace for innovation.
For the immediate future at least, it appears to be a straight shootout between Hortonworks and Cloudera for marketshare, but take a look at the horse coming up on the outside, MapR. Its deals in June could have a lasting impact in the longrun.
Whilst some say Hadoop has had its time in the sun as the Big Data standard, the moves made recently suggest that the yellow elephant may just have something to say about that.