Hadoop column storage format Parquet now generally available
Thanks to Twitter and Clouderas efforts, Hadoop has a new storage format ready-made for analytic processing.
Little more than three months since the covers were lifted off the project, Parquet has hit its 1.0 milestone.
The Apache Hadoop columnar storage format library is the result of a partnership between Twitter and Cloudera, with the intention to make Parquet a critical cog in next generation Hadoop architecture.
The concept is fairly simple according to Twitter analytics infrastructure engineering manager Dmitriy Ryaboy. Rather than storing records in rows, Parquet makes it possible to stack data on top of each other. As column values share the same type, generic compression is far easier; and because values are stored consecutively, query engines can skip loading unneeded columns.
While an ideal fit for those interested in analytical processing, Ryaboy warns that “implementing a columnar storage format” into a Hadoop-based processing engine is “tricky.”
“Not all data people store in Hadoop is a simple table — complex nested structures abound,” he explains. “For example, one of Twitter’s common internal datasets has a schema nested seven levels deep, with over 80 leaf nodes.”
March’s unveiling was well-received by the wider Hadoop community, with the Parquet contributor list expanding to 18 contributors. Due to the arrival of newcomers from the likes of Criteo and UC Berkeley, plenty of new improvements are in the 1.0 release, such as Java-implemented dictionary encoding and hybrid bit-packing.
Like many big data projects, the team has taken a leaf out of Google’s book, adopting the approach put forward by the interactive analysis paper Dremel. Parquet also doesn’t tie itself down to any other existing Hadoop projects.
Parquet contains MapReduce Input and Output format and has support for both the Hadoop 1.0 and 2.0 series APIs. Integration has been achieved with a number of tools in the Hadoop ecosystem, including Hive, Pig, Cascading and the new breed SQL engines Impala and Drill (although the latter is a work in progress), so you can also expect support for all the above.
Ryaboy says that the next major goal for Parquet is to foster a large community backing. If the move to the Apache Incubator is still on the cards, it should be in the best hands to achieve this. The demand for columnar storage in data warehousing is growing too, with products like Amazon RedShift arriving earlier in the year, and it’s high time Hadoop had a project capable of the functionality.
With a number of souped-up query engines appearing in recent months, such as Impala, Drill and the Stinger Initiative for Hive, Parquet could well be a useful flexible companion in the next wave of Hadoop applications.
Parquet image courtesy of couscouschocolat