No sacrifices necessary
Metamarkets open sources distributed database Druid
secret that the latest challenge for the ‘big data’ movement is
moving from batch processing to real-time analysis. Metamarkets,
who provide “Data Science-as-a-Service” business analytics,
last year revealed details of in-house distributed database
Druid - and have this week released
it as an open source project.
Druid was designed to solve the problem of a database which allows multi-dimensional queries on data as and when it arrives. The company originally experimented with both relational and NoSQL databases, but concluded they were not fast enough for their needs and so rolled out their own.
The company claims that Druid’s scan speed is “33M rows per second per core”, able to ingest “up to 10K incoming records per second per node”. An earlier blog post outlines how the company managed to achieve scan speeds of 26B records per second using horizontal scaling. It does this via a distributed architecture, column orientation and bitmap indices.
GigaOM reports that CEO Mike Driscoll said he chose to open source the technology, which powers their web-based data analysis software, in an attempt to make it an industry standard. Like open-sourcing of certain technologies by Twitter, it offers no competitive downside - since the data analysis service, rather than Druid itself, is their key product.
The release comes in the same week that Cloudera unveiled Impala, a real-time SQL-based query engine for Hadoop - yet another tool for instant data analysis.
As with seemingly all modern open source projects, you can find the source code on GitHub at github.com/metamx/druid.