Metamarkets open sources distributed database Druid
Data Science-as-a-Service company offer their speedy custom database technology to the world.
secret that the latest challenge for the ‘big data’ movement is
moving from batch processing to real-time analysis. Metamarkets,
who provide “Data Science-as-a-Service” business analytics,
last year revealed details of in-house distributed database
Druid – and have this week released
it as an open source project.
Druid was designed to solve the problem of a database which allows
multi-dimensional queries on data as and when it arrives. The
company originally experimented with both relational and NoSQL
databases, but concluded they were not fast enough for their needs
and so rolled out their own.
The company claims that Druid’s scan speed is “33M rows per second
per core”, able to ingest “up to 10K incoming records per second
per node”. An earlier blog post
outlines how the company managed to achieve scan speeds of 26B
records per second using horizontal scaling. It does this via a
distributed architecture, column orientation and bitmap
reports that CEO Mike Driscoll said he chose to open source the
technology, which powers their web-based data analysis software, in
an attempt to make it an industry standard. Like open-sourcing of
certain technologies by Twitter, it offers no competitive downside
– since the data analysis service, rather than Druid itself, is
their key product.
The release comes in the same week that Cloudera unveiled
Impala, a real-time SQL-based query engine for Hadoop – yet
another tool for instant data analysis.
As with seemingly all modern open source projects, you can find the
source code on GitHub at github.com/metamx/druid.