Scaling for Big Data

An introduction to Column Databases

Cory Isaacson

How to successfully integrate Column Databases into Big Data architecture.

In an earlier column, I reviewed some of the basic the types of DBMS engines. Well, in this ever-expanding, rapidly changing field of Big Data there are more types that deserve to be part of the discussion.

In this article, I will cover Column Databases, a great solution for data warehouse type applications.

What is a Column Database?

It is easiest to understand how a Column Database works by comparing it to traditional RDMBS engines. As you know, traditional RDBMS engines store data in a row-based fashion, like this:

This structure works very well for transactional applications, all data for each unique row is store together. You can see that it is very convenient for adding, updating or retrieving a single customer row at a time, like this:

INSERT INTO customer (customer_id, customer_name, start_date, type, zip_code)
(3, ‘Unlimited Enterprises, Inc.’, ‘5/1/2013’, ‘Economy, 33333);
UPDATE customer 
SET zip_code = 44444
WHERE customter_id = 2;
FROM customer
WHERE customer_id = 3;

This type of thing works fine, performs well and will go as fast as the RDBMS can process single row reads and writes. But what happens in the analytic case, when you need to scan lots of rows to find specific information or summarize data?

Here is an example query that can be very expensive with a row-based structure:

FROM customer
WHERE start_date BETWEEN ‘1/1/2013’ AND ‘3/31/2013’
GROUP BY type;

This type of query often causes a table scan, unless you add an index on the key field (start_date in this case). Even then it can still take a lot of processing, because a lot of data in the row-based structure must be scanned and read. To make matters worse, what happens when you have a table with many columns and you run complex queries across several column values? It’s not practical to index them all as it slows down writes, plus the RDBMS can only effectively use one or two indexes to limit the number of rows to scan. This is all very important when it comes to Big Data scale for analytics, tables with tens of millions to billions of rows.

Introducing the Column Database

A Column Database is designed to address this exact situation. As you probably have guessed, a Column Database stores data by column, instead of by row. The data is stored separately for each column, with pointers back to the logical row that the value belongs too.

Here is what a column database structure might look like:

Why does this work better for analytic queries? There are many factors, and some depend on the specific implementation of the column database engine itself. Here are some of the basic points that are important:

A column database can answer a query (like the COUNT/GROUP BY query above) very efficiently. The way it works is it can scan a single column to find the values it needs, determine what rows match that query, and then further filter by other column values.  If you consider it this way, a Column Database is really a set of indexes, one for each column.

Column data can be compressed using various algorithms, making the data much smaller. This is particularly true based on the cardinality of the column (i.e., the number if unique values in a single column). Low-cardinality columns (those with a small number of unique values) can compress extremely well, and be searched very quickly. High-cardinality columns (those with many unique values) will take longer to search, but still may be compressed, and because the scan is of a single column results can be produced much faster.

Let’s look at a slightly more complex query. Note: I am using SQL for the example queries, but not all column databases use the SQL language, some have other languages or query syntax.

SELECT type, zip_code, COUNT(*)
FROM customer
WHERE start_date BETWEEN ‘1/1/2013’ AND ‘3/31/2013’
GROUP BY type, zip_code;
In this query, we now have two columns we are grouping by: type and zip_code. A Column Database can perform this very effectively, first scanning one of the columns (let’s say zip_code), then taking that list of row numbers and filtering again by the second column (type). The results then must be sorted, again an easier operation because the data values are evaluated independently.

Thus, the result can be returned extremely quickly, even with a large number of rows. In actual practice, we have seen Column Databases out-perform a traditional RDBMS by up to 100:1, they can be really fast for the right types of operations.

Things to Consider about Column Databases

Because of the way Column Databases are structured, you cannot work with them in the same way as your traditional RDBMS. Here are some of the important points:

Data normally should be bulk-loaded in big chunks of rows (typically 1000 or more rows in a chunk). A Column Database must do a lot of work to build the column structure when new data is added, including compressing column information.  You should not use it in a transactional manner, adding one row at a time. Therefore, Column Databases work best for the analytics use case, like a data warehouse of historical, non-changing data.

Extracting your data can be slow. If you need to dump data out of a column database, depending on the particular engine implementation, the process can be far slower than what you would expect. The reason is that the column database must uncompress and reassemble its column structure back into a row-based format.

Column Database vendors support Materialized Views, which are pre-summarized or pre-computed complex queries stored for the next use. These can be very effective if you have a lot of similar or identical queries in your use case.

Column Databases need a lot of memory. This is due to the fact that they work best when data for a given column can fit in RAM. Therefore, the hardware requirements may be different that your traditional database.

Wrapping it up

In this article, I have provided a basic tour and description of how Column Databases work. These DBMS engines are a powerful weapon on the Big Data arsenal, especially for heavy analytics use cases. Integrating a Column Database into your overall Big Data architecture can provide a very successful strategy for meeting this type of requirement.

In future articles I will continue to review various types of DBMS engines, and then begin to cover some specific vendors and their capabilities.

Row of columns photo via Shutterstock
Cory Isaacson
Cory Isaacson is CEO/CTO of CodeFutures Corporation, maker of dbShards, a leading database scalability suite providing a true “shared nothing” architecture for relational databases. Cory has authored numerous articles in a variety of publications including SOA Magazine, Database Trends and Applications, and recently authored the book Software Pipelines and SOA (Addison Wesley). Cory has more than twenty years experience with advanced software architectures, and has worked with many of the world’s brightest innovators in the field of high-performance computing. Cory can be reached at: [email protected]

Inline Feedbacks
View all comments