Eyes on HAWQ
Pivotal ship Hadoop distro complete with world’s fastest SQL query engine
Pivotal, the spinoff of VMware and EMC technologies, has shipped its first product - a Hadoop distribution claiming to be 100% Apache based.
Back in February, the covers came off Pivotal HD (standing for Hadoop distribution) to little fanfare, just after the creation of the ‘next generation’ company formed of big data and cloud technologies. After rigorous testing at the 1000-node strong Pivotal Analytics Workbench, we now know more behind the stack.
Pivotal HD 1.0 is built on the Hadoop 2.0.2 codebase, meaning it has both the old MapReduce processing method and the highly anticipated YARN algorithm, touted for the next major Hadoop release.
YARN (Yet Another Resource Negotiator) is the community’s attempt to overcome the shortcomings of the batch processing method in Hadoop, transforming it into a near real-time data platform that clients want. The new distributed processing framework can handle and schedule resource requests from multiple applications, while at the same time, oversee its execution.
Including both the tried and tested main mechanism that helped Hadoop become an enterprise favourite, and the foundation of Hadoop 2.0 should help generate interest. Whether it was wise to jump ahead of competitors though is up for debate. Hortonworks (leading YARN development) don’t include it in their latest stable distribution, but do in the community preview of Hortonworks Data Platform 2.0.
The Community Edition of the Pivotal HD stack is entirely comprised of Apache Hadoop stablemates. Aside from the standard processing and storage components, MapReduce and HDFS, the distribution also contains fellow big data projects such as data warehouse Hive, the machine learning-focused Mahout, query language Pig and NoSQL datastore HBase. Completing the set are distributed configuration service Zookeeper, speedy bulk data transfer tool Sqoop and the distributed log mover Flume.
The Enterprise Edition throws in a number of essential add-ons for corporate customers. There is support for the Spring framework and the number of projects in its ecosystem, such as processing framework Spring Batch and Spring for Apache Hadoop, which simplifies Hadoop application development for users of the enterprise Java framework. Also included is Project Serengeti, an initiative started by VMware last year to make virtualisation environments “Hadoop-aware”.
Pivotal HD’s ace in the hole however is HAWQ, the self-proclaimed “world’s fastest” SQL query engine on Hadoop, crafted from a decade of experience with Greenplum databases. The real-time parallel query engine replaces Hive in the enterprise version boasts up to 600x performance improvements for a number of query types and workloads.
Though competitors Cloudera and MapR have their own efforts, the invaluable SQL experience attained from Greenplum, which has been applied to the Hadoop world, could prove to be decisive, especially as Cloudera’s Impala is comparatively young.
The Enterprise Edition also includes Command Center, a graphical interface for managing clustering and monitoring jobs, an UI ingestion tool called Data Loader that supports either bulk or batch loading and Unified Storage Service. The latter could prove critical to acquiring Hadoop newcomers as it is essentially an abstraction layer to access storage systems outside of HDFS such as NFS, and make them sing to Hadoop’s tune.
Pivotal have also announced the availability of Pivotal HD Single Node, a VM that contains the raw Pivotal HD and HAWQ components and tutorials for users to test drive. The Community and Enterprise Editions are both available to download right now.
Images courtesy of left-hand & Pivotal