One To Watch: Apache Samza, the continuous computation stream processing system
Theres a new project in the Apache Incubator, as LinkedIn pushes their stream processing framework to the open source foundation.
As we hurtle towards Apache Hadoop 2.0, we’ve seen a number of new projects spring up, ushering the next generation for the big data processing framework. The stability of the core platform has been nailed down and now we’re beginning the next phase of innovation, such as speeding up query time in Impala, real-time analytics with Apache Drill and complete lifecycle management in Apache Ambari.
It’s no bad thing to see such a flurry of activity but keeping track on the latest data developments isn’t easy. There’s now another name to remember with a new proposal, Samza, appearing recently in the Apache Incubator.
Samza’s concept is simple enough. The stream processing system is designed to run continuous computation on infinite streams of data from publish-subscribe systems. The user writes a stream processing task and executes it as a Samza job. Samza then acts as the operators, routing messages between the tasks and systems they are addressed to
Created within LinkedIn to “enable easier processing of streaming data on top of Apache Kafka”, the framework also uses Apache YARN, the resource manager set for Hadoop 2.0, to deploy tasks in a distributed clusters. The proposal also states YARN is used to “make decisions about stream processor locality, co-partition of streams and provide security.”
It’s hardly a surprise to see the social network’s development team plumb for Kafka, the distributed publish-subscribe system they created, nor is it to see them use Scala, being big fans of the functional JVM language.
The LinkedIn team have chosen to bring the project to Apache in hope that it “will be useful to many organizations facing a similar need” to theirs, but also say that because of Samza’s design, new use cases will emerge.
“The ASF is the natural choice to host the Samza project as its goal of encouraging community-driven open-source projects fits with our vision for Samza,” the proposal reads. With many other data projects residing at Apache, the team expect Samza to integrate with Zookeeper, YARN, HDFS and log4j in time.
Despite being created just 18 months ago, Samza’s committers say the framework has seen fast adoption. At the same time they recognise the real challenge is to foster a diverse enough community behind it, with LinkedIn currently being the only internal user. Fortunately, the four initial committers all hold varying degrees of open source experience.
With LinkedIn’s engineering pedigree held in great regard, the project should fly once it gets out of the Apache Incubator.
Stream image courtesy of audreyjm529