The elephant is welcome in the room

MongoDB boosts Hadoop capabilities with Apache Hive upgrade

Chris Mayer
mongodb.13

The two darlings of Big Data have moved even closer together after 10gen added a raft of improvements for their Hadoop connector for MongoDB.

10gen, the company behind NoSQL document
datastore MongoDB, has upgraded their Hadoop Connector, allowing
users to shift data between the two Big Data technologies more
easily.

With Hadoop still the talk of the industry, the
database proprietor has seen the importance of keeping the yellow
elephant as a close bedfellow. Though the database is fairly
established as the NoSQL leader, the challenge now is to add extra
functionalities to make it an all-singing and dancing data
analyser. Making MongoDB capable of processing, without the need to
move reams of data across to Apache Hadoop, should make the
database a huge draw.

The MongoDB Connector for Hadoop calculates
splits within a collection and assigns each split to a node in a
Hadoop cluster. At the same time, the nodes pull data from MongoDB
and processes it locally. This is then merged and the output is
streamed back to MongoDB or BSON.


Tuesday’s update
adds support for MongoDB’s
native BSON (binary JSON) backup files within the connector. This
can either be stored locally in HDFS, Hadoop’s storage component,
to be processing within the big data framework, or pushed out to
local or cloud-based file systems due to JSON’s ubiquity across the
industry.

In what is the first major upgrade to the
connector since its

arrival in April 2012
, the connector now
enables Apache Hive, Hadoop’s data warehousing framework, to access
BSON files
in order to run SQL-like
queries across data sets. As MongoDB
is normally
queried using a
proprietary
language
, this clearly represented a huge
technical challenge. Despite being the data interface standard,
JSON isn’t compatible with Hadoop either. As such, the engineering
team behind the connector say that full support for MongoDB
collections will only be available in the next release, later in
the year, likely to iron out the

kinks.

The connector already supports a number of
Hadoop-related projects such as MapReduce, Pig, Hadoop Streaming
(at least with node.js, Python and Ruby) and Flume.

Another new feature of MongoDB Connector for
Hadoop 1.1 is the option to run incremental MapReduce jobs, meaning
users can aggregate trends on a daily basis through modifying an
existing MongoDB collection. With Hadoop gaining more and more
admirers, it is essential that modern databases, like MongoDB,
offer functionality with the data processing technology. Otherwise,
customers may take their business to a database friendly with the
yellow elephant.

Author
Comments
comments powered by Disqus