Eyes on HAWQ

Pivotal ship Hadoop distro complete with world’s fastest SQL query engine

Chris Mayer
hawk

After rigorous testing, the company spun out of VMware and EMC technologies have made their Hadoop stack available for download. But what makes it stand out?

Pivotal, the spinoff of VMware and EMC
technologies, has shipped its first product – a
Hadoop
distribution
claiming to be 100% Apache
based.

Back
in February
, the covers came off Pivotal HD
(standing for Hadoop distribution) to little fanfare, just after
the creation of the ‘next generation’ company formed of big data
and cloud technologies. After rigorous testing at the 1000-node
strong Pivotal Analytics Workbench, we now know more behind the
stack.

Pivotal HD 1.0 is built on the Hadoop 2.0.2
codebase, meaning it has both the old MapReduce processing method
and the highly anticipated YARN algorithm, touted for the next
major Hadoop release.


YARN
(Yet Another Resource Negotiator) is
the community’s attempt to overcome the shortcomings of the batch
processing method in Hadoop, transforming it into a near real-time
data platform that clients want. The new distributed processing
framework can handle and schedule resource requests from multiple
applications, while at the same time, oversee its
execution.

Including both the tried and tested main
mechanism that helped Hadoop become an enterprise favourite, and
the foundation of Hadoop 2.0 should help generate interest. Whether
it was wise to jump ahead of competitors though is up for debate.
Hortonworks (leading YARN development) don’t include it in their
latest stable distribution, but do in the community preview of
Hortonworks Data Platform 2.0.


The Community Edition of the Pivotal HD stack is
entirely comprised of Apache Hadoop stablemates. Aside from the
standard processing and storage components, MapReduce and HDFS, the
distribution also contains fellow big data projects such as data
warehouse Hive, the machine learning-focused Mahout, query language
Pig and NoSQL datastore HBase. Completing the set are distributed
configuration service Zookeeper, speedy bulk data transfer tool
Sqoop and the distributed log mover Flume.

The Enterprise Edition throws in a number of
essential add-ons for corporate customers. There is support for the
Spring framework and the number of projects in its ecosystem, such
as processing framework Spring Batch and
Spring for
Apache Hadoop
, which simplifies Hadoop application
development for users of the enterprise Java framework. Also
included is

Project Serengeti
, an initiative started by
VMware last year to make virtualisation environments
“Hadoop-aware”.

Pivotal HD’s ace in the hole however is

HAWQ
, the self-proclaimed “world’s fastest”
SQL query engine on Hadoop, crafted from a decade of experience
with Greenplum databases. The real-time parallel query engine
replaces Hive in the enterprise version boasts up to 600x
performance improvements for a number of query types and
workloads.


Though competitors Cloudera and MapR have their
own efforts, the invaluable SQL experience attained from Greenplum,
which has been applied to the Hadoop world, could prove to be
decisive, especially as Cloudera’s Impala is comparatively
young.

The Enterprise Edition also includes Command
Center, a graphical interface for managing clustering and
monitoring jobs, an UI ingestion tool called Data Loader that
supports either bulk or batch loading and Unified Storage Service.
The latter could prove critical to acquiring Hadoop newcomers as it
is essentially an abstraction layer to access storage systems
outside of HDFS such as NFS, and make them sing to Hadoop’s
tune.

Pivotal have also announced the availability of Pivotal HD
Single Node, a VM that contains the raw Pivotal HD and HAWQ
components and tutorials for users to test drive. The Community and
Enterprise Editions are both available to

download
right now.

Images courtesy of left-hand &
Pivotal

Author
Comments
comments powered by Disqus