Storm-YARN brings real-time to Hadoop, now open source
Yahoo engineers integrate Twitters stream processing tech with Hadoop for scalability and data-sharing.
open-sourced Storm-YARN, which links data-crunching platform
Hadoop with Twitter’s real-time alternative, Storm.
While Hadoop may be far and away the market leader in
terms of big data processing platforms, it wasn’t designed for
analysis of constant streams of data (for example, identifying
Storm, on the
other hand, is optimised specifically to do “for realtime
processing what Hadoop did for batch processing”.
At Yahoo, the original home of Hadoop, engineers have
been working since the beginning of the year on getting the two to
play nice. “Enabling low-latency big-data processing is one of the
primary design goals of Yahoo!’s next-generation big-data
platform,” wrote Architect Andy Feng in a
blog post in February.
Their efforts, dubbed Storm-YARN, intricates the two,
allowing Storm to tap into Hadoop’s data and spare processing
power, all hosted on the same cluster (an alpha version is available now on GitHub).
‘YARN’ is the next version of Hadoop’s resource-managing MapReduce
component, and the focus of Hadoop v2.0.
Even better, Storm can utilise the processing power of
existing Hadoop clusters, meaning that it can rapidly scale up and
down to match demand. Real-time loads tend to be more variable than
batch processed data, making this elasticity useful – although
Storm-YARN is said to merely “lay the groundwork” for further
progress in this area.
Storm has also been enhanced with Hadoop-style
security mechanisms, allowing data to be shared between the two via
HBase and HDFS. This integration also reduces the physical distance
of data transfers between systems, presumably an important
consideration when dealing with truly massive amounts of data.
Yahoo’s approach to real-time processing is notably
different from others within the Hadoop scene. Cloudera and MapR
are working on new, super-fast query tools for Hadoop called
Drill, both directly inspired by Google Dremel, to allow for
real-time, ad-hoc queries.
Hortonworks is taking a different approach,
reportedly speeding up Hadoop’s existing SQL query tool Apache
Hive by 50x. The ultimate, ambitious aim of this so-called “Stinger
Initiative” is to boost Hive’s speed by 100x.
With real-time stream processing apparently the new holy grail
for the Hadoop crowd, it remains to be seen which approach is more
efficient in the long run.
Adorable knitted bird image by katesheets.