Laying the groundwork

Hadoop column storage format Parquet now generally available

Chris Mayer
parquet2.11

Thanks to Twitter and Cloudera’s efforts, Hadoop has a new storage format ready-made for analytic processing.

Little more than three months since the covers were lifted off
the project, Parquet has hit its
1.0 milestone.

The Apache Hadoop columnar storage format
library is the result of a partnership between Twitter and
Cloudera, with the intention
to make
Parquet a critical cog in next generation Hadoop
architecture.

The concept is fairly simple according to
Twitter analytics infrastructure engineering manager


Dmitriy Ryaboy
. Rather than storing records
in rows, Parquet makes it possible to stack data on top of each
other. As column values share the same type, generic compression is
far easier
; and because values are stored
consecutively, query engines can skip loading unneeded
columns.

While an ideal fit for those interested in
analytical processing, Ryaboy warns that “implementing a columnar
storage format” into a Hadoop-based processing engine is
“tricky.”

“Not all data people store in Hadoop is a simple
table — complex nested structures abound,” he explains. “For
example, one of Twitter’s common internal datasets has a schema
nested seven levels deep, with over 80 leaf nodes.”

March’s unveiling was well-received by the wider
Hadoop community, with the Parquet contributor list expanding to 18
contributors. Due to the arrival of newcomers from the likes of
Criteo and UC Berkeley, plenty of new improvements are in the 1.0
release, such as Java-implemented dictionary encoding and hybrid
bit-packing.

Like many big data projects, the team has taken
a leaf out of Google’s book, adopting the approach put forward by
the interactive analysis paper Dremel. Parquet also doesn’t tie
itself down to any other existing Hadoop projects.

Parquet contains MapReduce Input and Output
format and has support for both the Hadoop 1.0 and 2.0 series APIs.
Integration has been achieved with a number of tools in the Hadoop
ecosystem, including Hive, Pig, Cascading and the new breed SQL
engines Impala and Drill (although the latter is a work in
progress), so you can also expect support for all the
above.

Ryaboy says that the next major goal for Parquet is to foster a
large community backing. If the move to the Apache Incubator is
still on the cards, it should be in the best hands to achieve this.
The demand for columnar storage in data warehousing is growing too,
with products like Amazon RedShift arriving earlier in the year,
and it’s high time Hadoop had a project capable of the
functionality.

With a number of souped-up query engines
appearing in recent months, such as Impala, Drill and the Stinger
Initiative for Hive, Parquet could well be a useful

flexible companion in the next wave of Hadoop
applications.

Parquet 1.0 is available to download on GitHub
and via
Maven Central
. It’s worth checking out this
session from
Hadoop Summit on
Parquet
, if your interest is piqued.

Parquet image courtesy of couscouschocolat

Author
Comments
comments powered by Disqus