Pushed downstream by LinkedIn

One To Watch: Apache Samza, the continuous computation stream processing system

Chris Mayer
Stream image courtesy of audreyjm529

There’s a new project in the Apache Incubator, as LinkedIn pushes their stream processing framework to the open source foundation.

As we hurtle towards Apache Hadoop 2.0, we’ve seen a number of
new projects spring up, ushering the next generation for the big
data processing framework. The stability of the core platform has
been nailed down and now we’re beginning the next phase of
innovation, such as speeding up query time in Impala, real-time
analytics with Apache Drill and complete lifecycle management in
Apache Ambari.

It’s no bad thing to see such a flurry of
activity but keeping track on the latest data developments isn’t
easy. There’s now another name to remember with a new
, Samza, appearing recently in the Apache

concept is simple enough. The stream processing system is designed
to run continuous computation on infinite streams of data from
publish-subscribe systems. The user writes a stream processing task
and executes it as a Samza job. Samza then acts as the operators,
routing messages between the
tasks and systems they are
addressed to

Created within LinkedIn to “enable easier
processing of streaming data on top of Apache Kafka”, the framework
also uses Apache YARN, the resource manager set for Hadoop 2.0, to
deploy tasks in a distributed clusters. The proposal also states
YARN is used to “make decisions about stream processor locality,
co-partition of streams and provide security.”

It’s hardly a surprise to see the social
network’s development team plumb for Kafka, the distributed
publish-subscribe system they created, nor is it to see them use
Scala, being big fans of the functional JVM language.

The LinkedIn team have chosen to bring the
project to Apache in hope that it “will be useful to many
organizations facing a similar need” to theirs, but also say that
because of Samza’s design, new use cases will emerge.

“The ASF is the natural choice to host the Samza
project as its goal of encouraging community-driven open-source
projects fits with our vision for Samza,” the proposal reads. With
many other data projects residing at Apache, the team expect Samza
to integrate with Zookeeper, YARN, HDFS and log4j in

Despite being created just 18 months ago,
Samza’s committers say the framework has seen fast adoption. At the
same time they recognise the real challenge is to foster a diverse
enough community behind it, with LinkedIn currently being the only
internal user. Fortunately, the four initial committers all hold
varying degrees of open source experience.

With LinkedIn’s
engineering pedigree
held in great regard,
the project should fly once it gets out of the Apache

comments powered by Disqus