Digging Deeper

Big Data digger Drill project appears in Apache Incubator - Hadoop killer?

The enterprise need for large-scale datasets has grown over the past few years - and it’s now the norm to have to deal with phenomenally large datasets. But as this demand has increased, so has the demand for something that is near to real-time as possible, yet still providing comprehensive analysis.

Currently, the standard tools such as the batch processing Hadoop and stream processing Storm, don’t allow this, although the groundwork has been put in place (by the Apache Hadoop 2.0 roadmap) to make the process more intuitive and provide deep analytic reporting at the click of the fingers.

Over the past two years, the Apache Software Foundation has assumed the caretaker role at the centre of Big Data evolution, and a new project has entered the Apache Incubator seeking to push the boundaries further for data-intensive operations.

Inspired by Google’s internal interactive system Dremel, Apache Drill will be a distributed system that scales across 10,000 servers and processes petabytes of data from trillions of rows in seconds. The aim is to shift away from established inefficient methods of processing to one that is flexible and can process nested data without too much heavy lifting.

Apache Drill comes from MapR Technologies - a Hadoop vendor that differs from the competition, in that rather than having a specific focus on making the Apache Hadoop codebase as strong as it can be, MapR chose to develop their own advanced flavour and make that available commercially. It’s not surprising to see them wanting to push the project at an open source level, to gain the approval of their supposed competitors and welcome them to become a part of it. After all, the key players want the ecosystem to be as healthy as possible.

The proposal notes the exploration problems that Apache Hadoop faces in analysing data at a sub-second level, opting for high-throughput first. Pushing it to the largest open source haven brings it closer to being the de-facto standard for data digging - and the only way to do that is to tap into the entire community. There’s no better environment currently than Apache for that.

Like Dremel, Drill doesn’t intend to replace MapReduce, the processing method in Hadoop still used to complement Dremel at Google. In fact, it would certainly want to work in conjunction with the already established technique. The committers behind Drill also realise that something of Dremel’s scale has yet to be attempted on an open source level, pulling the array of query languages and data formats. Consequently, it could spend a fair bit of time within the Apache Incubator before its initial arrival.

Apache Drill is split into four key components, which the team say will form the bulk of the next move, ensuring that all four layers are implemented. They are:

  • Query languages - responsible for parsing queries and constructing the execution plan. Initially this will support Dremel’s SQL language and GoogleBigQuery, but further along expect NoSQL solutions MongoDB and Cascading to feature.
  • An execution engine, providing the necessary scalability and fault tolerance needed to query the vast amount of data.
  • Support for nested data formats such as JSON, BSON and CSV, amongst others.
  • Support for scalable data sources, with an initial plan to leverage Hadoop.

Whilst MapR will undoubtedly lead the project (through the expertise of Hadoop veteran Ted Dunning and execution specialist Tomer Shiran), other companies such Drawn to Scale and Concurrent will sit alongside as early committers. The design documents will be housed within MapR repositories.

It’s a very bold proposition, but a necessary one for the world of Big Data should it want to progress in line with the demand of the enterprises who support the community. This proposal is still in its infancy so it could be some before we see defined architecture but early signs are promising. The initial proposal slides provide more detail on Drill itself.

The potential community interested in something like this is huge, linking into other Big Data focused projects. It will also be interesting to see Drill grow to include such projects and whether or not it will adopt an inclusive policy, or choose only the cream of the crop. Hadoop, Avro, Hive and HBase have already been touted as close bedfellows - now we wait to see who else will jump onboard...

Image courtesy of cliff1066â„¢

Chris Mayer

What do you think?

Comments

Latest opinions