Cuddling up to the elephant
Microsoft to open source Hadoop-helper REEF framework
With Hadoop 2.0 just around the corner, we’re beginning to see a number of open source efforts appear related to YARN, the next generation resource manager which lets users run multiple types of job at the same time.
Making Hadoop capable of managing batch and processing jobs in the same cluster has understandably created a buzz in the community, with the potential to greatly reduce Hadoop’s ETL and provide a new analytical purpose for the big data framework.
One company that we didn’t expect to be joining the big data corral so willingly however is Microsoft, who yesterday announced their intentions to open source REEF (Retainable Evaluator Execution Framework). The framework, which runs on top of YARN, aims to make it easier to implement scalable fault-tolerant environments. It is particularly well versed at building machine-learning jobs, according to Microsoft CTO of Information Services, Raghu Ramakrishnan, speaking at the International Conference for Knowledge Mining and Data Discovery on Monday.
Details on the framework are sketchy, with only a few conference session abstracts to go off and no technical documentation to hand. What we do know is that REEF seems to be a fairly diverse framework. Through a distributed control flow abstraction, it can support MapReduce workloads, graph processing or iterative algorithms, such as those required for machine learning.
In order to separate REEF from the systems built on top of it, Microsoft have created two standalone systems: a configuration manager dubbed Tang and event-driven data movement framework Wake. The two are language agnostic and enable REEF to work in JVM or .NET environments.
Further information on REEF should emerge closer to its full open sourcing. The project’s flexibility, as well the nature of the problems it is tackling (the ones which enterprises demand Hadoop sort out) makes it an interesting one to watch. Equally intriguing to monitor is Microsoft’s overt embrace of Hadoop, after putting the buffers on its own big data framework Dryad in late 2011. But rather than being a Hadoop freeloader, they are finally beginning to put back in with useful open source efforts such as REEF, which could help shape Hadoop’s future direction.
Image courtesy of Derek Keats