Welcome to the ecosystem

Cloudera and Twitter bring Parquet to the Hadoop floor

Chris Mayer
parquet21

Planned to help cement Impala’s real time credentials, the two companies unveil details behind a new column format for Hadoop.

Cloudera have joined forces with Twitter to open source a
new columnar storage format for Hadoop, called
Parquet.

The duo say the Github-hosted project
was designed to provide “compressed, efficient
columnar data representation” for any cog part of the Hadoop
ecosystem, meaning that Parquet is language and data model
agnostic.

Like recent emerging projects such as Impala and Apache
Drill,
Parquet draws ideas heavily from Google’s 2010
research paper,
Dremel,
which pioneered the repetition/definition level approach for
encoding nested data structures. Parquet separates the encoding and
compression concepts, allowing developers to specify compression
schemes per column and implement operators on encoded data without
fear of decompression.

Hadoop’s dominant column database is HBase, but the team
behind the project insist that the project isn’t about ousting an
old favourite, but offering further choice in a vibrant
ecosystem.

“Parquet is built to be used by anyone,” assures
Dimitriy
Ryaboy
, part of Twitter’s Analytics Infrastructure
team.

“The Hadoop ecosystem is rich with data processing
frameworks, and we are not interested in playing favorites. We
believe that an efficient, well-implemented columnar storage
substrate should be useful to all frameworks without the cost of
extensive and difficult to set up dependencies.”

Cloudera will include a preview of Parquet in their real-time
query engine

Impala
. Ryaboy reveals that with the two in
unison, Impala boasts up to 10x performance improvement when
compared to competitors, and believe there’s “room for improvement”
too.

While the project is clearly open to other vendors to
contribute to, Cloudera will be hoping this latest project will
floor the opposition.


The Stinger Initiative
unveiled by fellow
Hadoop vendor, Hortonworks, last month explained how they were
planning to renovate HBase, going down a different route to
Cloudera. It’ll be interesting to see over the coming months
whether renewing the old or starting afresh will pay
dividends.

The other partner, Twitter have already begun to use the tool
in production, converting some of its major data sources to
Parquet’s method. It is still under heavy development with several
features planned in the near future. This includes support for data
warehouse system Hive and Hadoop’s abstraction layer Cascading
(provided by Criteo), as well as improvement to the Pig support
already in place.

The team are welcoming feedback on Parquet and plan to submit
it to the Apache Incubator, in hope of fostering a community behind
it, further down the line.

Image courtesy of Môsieur J. [version
8.0]

Author
Comments
comments powered by Disqus