Storm-YARN brings real-time to Hadoop, now open source
Yahoo engineers integrate Twitters stream processing tech with Hadoop for scalability and data-sharing.
Yahoo have open-sourced Storm-YARN, which links data-crunching platform Hadoop with Twitter’s real-time alternative, Storm.
While Hadoop may be far and away the market leader in terms of big data processing platforms, it wasn’t designed for analysis of constant streams of data (for example, identifying trending hashtags).
Storm, on the other hand, is optimised specifically to do “for realtime processing what Hadoop did for batch processing”.
At Yahoo, the original home of Hadoop, engineers have been working since the beginning of the year on getting the two to play nice. “Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform,” wrote Architect Andy Feng in a blog post in February.
Their efforts, dubbed Storm-YARN, intricates the two, allowing Storm to tap into Hadoop’s data and spare processing power, all hosted on the same cluster (an alpha version is available now on GitHub). ‘YARN’ is the next version of Hadoop’s resource-managing MapReduce component, and the focus of Hadoop v2.0.
Even better, Storm can utilise the processing power of existing Hadoop clusters, meaning that it can rapidly scale up and down to match demand. Real-time loads tend to be more variable than batch processed data, making this elasticity useful – although Storm-YARN is said to merely “lay the groundwork” for further progress in this area.
Storm has also been enhanced with Hadoop-style security mechanisms, allowing data to be shared between the two via HBase and HDFS. This integration also reduces the physical distance of data transfers between systems, presumably an important consideration when dealing with truly massive amounts of data.
Yahoo’s approach to real-time processing is notably different from others within the Hadoop scene. Cloudera and MapR are working on new, super-fast query tools for Hadoop called Impala and Drill, both directly inspired by Google Dremel, to allow for real-time, ad-hoc queries.
Hortonworks is taking a different approach, reportedly speeding up Hadoop’s existing SQL query tool Apache Hive by 50x. The ultimate, ambitious aim of this so-called “Stinger Initiative” is to boost Hive’s speed by 100x.
With real-time stream processing apparently the new holy grail for the Hadoop crowd, it remains to be seen which approach is more efficient in the long run.
Adorable knitted bird image by katesheets.