One to Watch

Graph processing Apache Giraph hits 1.0 – will it follow in Hadoop’s footsteps?

Chris Mayer

Already used by social networking giants Facebook, Twitter and LinkedIn, is Apache Giraph set to lead the charge for scalable social graphs?

A year on from becoming an Apache Top Project, iterative
graph processing system
Apache Giraph
has had its first major release
under the Apache Software Foundation banner.

Initially donated by Yahoo, Giraph is inspired
by the BSP (bulk synchronous parallel) model and

Google’s Pregel
paper, which demonstrates a
model for scalable and fault-tolerant graph computing. It’s not the
first time a Google paper has laid the foundations for an data
software project – after all, Apache Hadoop is an open source
implementation of Google’s MapReduce paper written in

While neo4j focuses solely on storage, Giraph
deals with heavy-duty processing. It runs on Hadoop infrastructure,
meaning it has
a no single point of
failure design. However MapReduce’s preference for keys and values
isn’t the most efficient method of producing graphs. Most graphs
also require repeated iteration, something which MapReduce can only
do through multiple chained jobs. Giraph instead opts for a more
natural BSP-style approach. Users can send and receive messages to
other vertices in the graph as the computation iterates.

The Giraph team claim that the system can be
scaled out to hundreds of machines “easily” and memory permitting,
hundreds of billions of edges. The project wisely keeps its ties
close to the Hadoop ecosystem that underpins it – Zookeeper is used
to achieve fault tolerance with its coordination service; there’s
access to and from Hive tables; plus there’s also support for YARN,
the newest processing framework touted for Apache Hadoop

Despite leaving the Apache Incubator last May,
only now has the Giraph codebase been
stable enough for production use.

Its arrival has surely come at the perfect time.
Interest in graph processing has

never been greater
. Twitter open sourced
Scala graph processing library

last March, and eyes are still
firmly fixed on Facebook’s Graph API.

As the userbase of social networking giants
continues to grow, so does the number of companies wanting to
follow suit. As such, there’s a greater need for faster methods of
gleaning personal information, such as shared connections and
personalisation-based popularity.

Pioneers such as Facebook, LinkedIn and Twitter are
already using Giraph, as well as contributing to its codebase,
which is surely an indication of its readiness for production
environments. Like Apache Hadoop before it, Giraph’s 1.0 release
could instigate a quick rise into more enterprise environment, as
well as welcoming more to join the community.

comments powered by Disqus