One to Watch
Graph processing Apache Giraph hits 1.0 - will it follow in Hadoop’s footsteps?
A year on from becoming an Apache Top Project, iterative graph processing system Apache Giraph has had its first major release under the Apache Software Foundation banner.
Initially donated by Yahoo, Giraph is inspired by the BSP (bulk synchronous parallel) model and Google’s Pregel paper, which demonstrates a model for scalable and fault-tolerant graph computing. It’s not the first time a Google paper has laid the foundations for an data software project - after all, Apache Hadoop is an open source implementation of Google’s MapReduce paper written in 2004.
While neo4j focuses solely on storage, Giraph deals with heavy-duty processing. It runs on Hadoop infrastructure, meaning it has a no single point of failure design. However MapReduce’s preference for keys and values isn’t the most efficient method of producing graphs. Most graphs also require repeated iteration, something which MapReduce can only do through multiple chained jobs. Giraph instead opts for a more natural BSP-style approach. Users can send and receive messages to other vertices in the graph as the computation iterates.
The Giraph team claim that the system can be scaled out to hundreds of machines “easily” and memory permitting, hundreds of billions of edges. The project wisely keeps its ties close to the Hadoop ecosystem that underpins it - Zookeeper is used to achieve fault tolerance with its coordination service; there’s access to and from Hive tables; plus there’s also support for YARN, the newest processing framework touted for Apache Hadoop 2.0.
Despite leaving the Apache Incubator last May, only now has the Giraph codebase been declared stable enough for production use.
Its arrival has surely come at the perfect time. Interest in graph processing has never been greater. Twitter open sourced Scala graph processing library Cassovary last March, and eyes are still firmly fixed on Facebook’s Graph API.
As the userbase of social networking giants continues to grow, so does the number of companies wanting to follow suit. As such, there’s a greater need for faster methods of gleaning personal information, such as shared connections and personalisation-based popularity.
Pioneers such as Facebook, LinkedIn and Twitter are already using Giraph, as well as contributing to its codebase, which is surely an indication of its readiness for production environments. Like Apache Hadoop before it, Giraph’s 1.0 release could instigate a quick rise into more enterprise environment, as well as welcoming more to join the community.