The elephant is welcome in the room

MongoDB boosts Hadoop capabilities with Apache Hive upgrade

Chris Mayer

The two darlings of Big Data have moved even closer together after 10gen added a raft of improvements for their Hadoop connector for MongoDB.

10gen, the company behind NoSQL document datastore MongoDB, has upgraded their Hadoop Connector, allowing users to shift data between the two Big Data technologies more easily.

With Hadoop still the talk of the industry, the database proprietor has seen the importance of keeping the yellow elephant as a close bedfellow. Though the database is fairly established as the NoSQL leader, the challenge now is to add extra functionalities to make it an all-singing and dancing data analyser. Making MongoDB capable of processing, without the need to move reams of data across to Apache Hadoop, should make the database a huge draw.

The MongoDB Connector for Hadoop calculates splits within a collection and assigns each split to a node in a Hadoop cluster. At the same time, the nodes pull data from MongoDB and processes it locally. This is then merged and the output is streamed back to MongoDB or BSON.

Tuesday’s update adds support for MongoDB’s native BSON (binary JSON) backup files within the connector. This can either be stored locally in HDFS, Hadoop’s storage component, to be processing within the big data framework, or pushed out to local or cloud-based file systems due to JSON’s ubiquity across the industry.

In what is the first major upgrade to the connector since its arrival in April 2012, the connector now enables Apache Hive, Hadoop’s data warehousing framework, to access BSON files in order to run SQL-like queries across data sets. As MongoDB is normally queried using a proprietary language, this clearly represented a huge technical challenge. Despite being the data interface standard, JSON isn’t compatible with Hadoop either. As such, the engineering team behind the connector say that full support for MongoDB collections will only be available in the next release, later in the year, likely to iron out the kinks.

The connector already supports a number of Hadoop-related projects such as MapReduce, Pig, Hadoop Streaming (at least with node.js, Python and Ruby) and Flume.

Another new feature of MongoDB Connector for Hadoop 1.1 is the option to run incremental MapReduce jobs, meaning users can aggregate trends on a daily basis through modifying an existing MongoDB collection. With Hadoop gaining more and more admirers, it is essential that modern databases, like MongoDB, offer functionality with the data processing technology. Otherwise, customers may take their business to a database friendly with the yellow elephant.

Inline Feedbacks
View all comments