Cloudera and Twitter bring Parquet to the Hadoop floor
Planned to help cement Impalas real time credentials, the two companies unveil details behind a new column format for Hadoop.
Cloudera have joined forces with Twitter to open source a
new columnar storage format for Hadoop, called
The duo say the Github-hosted project was designed to provide “compressed, efficient columnar data representation” for any cog part of the Hadoop ecosystem, meaning that Parquet is language and data model agnostic.
Like recent emerging projects such as Impala and Apache Drill, Parquet draws ideas heavily from Google’s 2010 research paper, Dremel, which pioneered the repetition/definition level approach for encoding nested data structures. Parquet separates the encoding and compression concepts, allowing developers to specify compression schemes per column and implement operators on encoded data without fear of decompression.
Hadoop’s dominant column database is HBase, but the team behind the project insist that the project isn’t about ousting an old favourite, but offering further choice in a vibrant ecosystem.
“Parquet is built to be used by anyone,” assures Dimitriy Ryaboy, part of Twitter’s Analytics Infrastructure team.
“The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.”
Cloudera will include a preview of Parquet in their real-time query engine Impala. Ryaboy reveals that with the two in unison, Impala boasts up to 10x performance improvement when compared to competitors, and believe there’s “room for improvement” too.
While the project is clearly open to other vendors to contribute to, Cloudera will be hoping this latest project will floor the opposition.
The Stinger Initiative unveiled by fellow Hadoop vendor, Hortonworks, last month explained how they were planning to renovate HBase, going down a different route to Cloudera. It’ll be interesting to see over the coming months whether renewing the old or starting afresh will pay dividends.
The other partner, Twitter have already begun to use the tool in production, converting some of its major data sources to Parquet’s method. It is still under heavy development with several features planned in the near future. This includes support for data warehouse system Hive and Hadoop’s abstraction layer Cascading (provided by Criteo), as well as improvement to the Pig support already in place.
The team are welcoming feedback on Parquet and plan to submit it to the Apache Incubator, in hope of fostering a community behind it, further down the line.
Image courtesy of Môsieur J. [version 8.0]