Let’s be specific

What is Big Data anyway?

Cory Isaacson

In the second part of his series, “Scaling for Big Data”, Cory Isaacson gets into the nuts and bolts of Big Data and attempts to pin down a concrete definition.

 We all know there is a tremendous focus on Big Data today, but exactly is Big Data anyway?

Many people make the mistake of thinking Big Data is only about advanced analytics, often of unstructured data. From my experience, this is a very limited definition, and the scope of Big Data will intrude on virtually any advanced application fitting a high growth need. It’s a fact that databases only get larger with time, which will put more and more applications into the Big Data requirement over time.

Here is a trimmed down version of what Wikipedia defines as Big Data:

In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation [the preservation and maintenance of digital assets], storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”…

Big data is difficult to work with using relational databases and desktop statistics and visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers”. What is considered “big data” varies depending on the capabilities of the organization managing the set. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”

As you can see, the focus of the definition is on analytics and trends, but from experience I believe Big Data concepts are much broader, and can be extended to many different database scenarios, including OLTP (online transaction processing), traditional data warehouse applications, and NoSQL engines.

Here is a more practical definition of Big Data:

A monolithic database meets the criteria for Big Data is when you have a scalability and performance problem with your database.

It doesn’t matter what technology you use, what type of DBMS engine, or really how big your database is (the Wikipedia definition indicates this as well, it just depends on the application requirement). When you run into performance barriers as your database grows, in my opinion that qualifies a genuine Big Data problem – one that you should solve with one or another scalable database technologies to overcome the limitation.

You should also note that database growth can be in terms of transaction volume and/or database size, as one or both can cause a monolithic database to become overloaded and slow down dramatically.

While it’s a well-known fact that databases slow down as they grow in size, what isn’t always obvious is the rapidity with which you can run into a performance problem. The fact is that a database doesn’t typically slow down in a nice, linear fashion – in fact performance usually follows an exponential “hockey-stick” type curve.

This can affect both database reads and writes. Write transactions represent the most common bottleneck we see with fast-growing, high-volume databases, and are often the driver for moving to a scalable database architecture. However, reads can also experience similar performance degradation, particularly with a high volume of users and complex queries.

To show how writes are affected, here is an example of an actual load test of a single, large, complex table using MySQL configured with the InnoDb engine:

This was a heavily used table from a customer database, with over 100 columns and 12 indexes, to accommodate a large range of analytics queries. While the size and number of indexes of the table is extreme (more than most applications would require), it does clearly illustrate the point of how dramatic database slow-down can occur.

As you can see during this test, we could load approximately 1GB in 1 minute, 3.5GB in about 12 minutes, reasonable times for our purposes. Then we tried to load the entire 39GB of data, and this took 10 days (until we finally gave up…, in fact we never could load it successfully without special handling).

Even with simpler more traditional table structures, as a table grows this same type of database slow-down can occur, with the same shape curve.

The important point to note is that when your database runs into this type of Big Data problem, you may not have a lot of time to react.

Early in my career we had a major production database that was working perfectly for months, with about 100 concurrent users. Then one day (literally overnight), some queries that were sub-millisecond the day before, began to take several seconds to execute, slowing down the entire application to unacceptable levels. Needless to say it was a very undesirable experience, one we had to react to very quickly with some quick changes. In that particular case we were lucky, adding a simple index to support the offending query did the trick. However, it’s rarely that simple to fix an issue like this, especially when you have extreme workloads and data growth.

The important point is this: Many applications today are experiencing Big Data challenges. The problem is becoming more prevalent all the time, as the nature of applications change toward taking advantage of the current data boom.

Therefore it’s important to understand these 3 things:

  • The cause of Big Data performance issues;

  • What you can do about these issues in the short-term;

  • Available technologies that can be used to overcome Big Data challenges for databases of virtually any size.

Next month I’ll be covering the common reasons for Big Data performance degradation, and how to recognize the symptoms as early as possible so that you can pro-actively respond to these issues.

Cory Isaacson
Cory Isaacson is CEO/CTO of CodeFutures Corporation, maker of dbShards, a leading database scalability suite providing a true “shared nothing” architecture for relational databases. Cory has authored numerous articles in a variety of publications including SOA Magazine, Database Trends and Applications, and recently authored the book Software Pipelines and SOA (Addison Wesley). Cory has more than twenty years experience with advanced software architectures, and has worked with many of the world’s brightest innovators in the field of high-performance computing. Cory can be reached at: [email protected]

Inline Feedbacks
View all comments