Let’s be specific

What is Big Data anyway?

Cory Isaacson
define

In the second part of his series, “Scaling for Big Data”, Cory Isaacson gets into the nuts and bolts of Big Data and attempts to pin down a concrete definition.

 We
all know there is a tremendous focus on Big Data today, but exactly
is Big Data anyway?

Many people make the mistake of thinking Big
Data is only about advanced analytics, often of unstructured data.
From my experience, this is a very limited definition, and the
scope of Big Data will intrude on virtually any advanced
application fitting a high growth need. It’s a fact that databases
only get larger with time, which will put more and more
applications into the Big Data requirement over time.

Here is a trimmed down version of what Wikipedia
defines as Big Data:

In information technology, big data is a
collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools. The
challenges include capture, curation [the preservation
and maintenance of digital assets], storage, search,
sharing, analysis, and visualization. The trend to larger data sets
is due to the additional information derivable from analysis of a
single large set of related data, as compared to separate smaller
sets with the same total amount of data, allowing correlations to
be found to “spot business trends, determine quality of research,
prevent diseases, link legal citations, combat crime, and determine
real-time roadway traffic conditions.”…

Big data is difficult to work with using
relational databases and desktop statistics and visualization
packages, requiring instead “massively parallel software running on
tens, hundreds, or even thousands of servers”. What is considered
“big data” varies depending on the capabilities of the organization
managing the set. “For some organizations, facing hundreds of
gigabytes of data for the first time may trigger a need to
reconsider data management options. For others, it may take tens or
hundreds of terabytes before data size becomes a significant
consideration.”

http://en.wikipedia.org/wiki/Big_data

As you can see, the focus of the definition is
on analytics and trends, but from experience I believe Big Data
concepts are much broader, and can be extended to many different
database scenarios, including OLTP (online transaction processing),
traditional data warehouse applications, and NoSQL
engines.

Here is a more practical definition of Big
Data:

A monolithic database meets the criteria for Big
Data is when you have a scalability and
performance problem with your
database.

It doesn’t matter what technology you use, what
type of DBMS engine, or really how big your database is (the
Wikipedia definition indicates this as well, it just depends on the
application requirement). When you run into performance barriers as
your database grows, in my opinion that qualifies a genuine Big
Data problem – one that you should solve with one or another
scalable database technologies to overcome the
limitation.

You should also note that database growth can be
in terms of transaction volume and/or database size, as one or both
can cause a monolithic database to become overloaded and slow down
dramatically.

While it’s a well-known fact that databases slow
down as they grow in size, what isn’t always obvious is the
rapidity with which you can run into a performance problem. The
fact is that a database doesn’t typically slow down in a nice,
linear fashion – in fact performance usually follows an exponential
“hockey-stick” type curve.

This can affect both database reads and writes.
Write transactions represent the most common bottleneck we see with
fast-growing, high-volume databases, and are often the driver for
moving to a scalable database architecture. However, reads can also
experience similar performance degradation, particularly with a
high volume of users and complex queries.

To show how writes are affected, here is an
example of an actual load test of a single, large, complex table
using MySQL configured with the InnoDb engine:

This was a heavily used table from a customer
database, with over 100 columns and 12 indexes, to accommodate a
large range of analytics queries. While the size and number of
indexes of the table is extreme (more than most applications would
require), it does clearly illustrate the point of how dramatic
database slow-down can occur.

As you can see during this test, we could load
approximately 1GB in 1 minute, 3.5GB in about 12 minutes,
reasonable times for our purposes. Then we tried to load the entire
39GB of data, and this took 10 days (until we finally gave up…, in
fact we never could load it successfully without special
handling).

Even with simpler more traditional table
structures, as a table grows this same type of database slow-down
can occur, with the same shape curve.

The important point to note is that when your
database runs into this type of Big Data problem, you may not have
a lot of time to react.

Early in my career we had a major production
database that was working perfectly for months, with about 100
concurrent users. Then one day (literally overnight), some queries
that were sub-millisecond the day before, began to take several
seconds to execute, slowing down the entire application to
unacceptable levels. Needless to say it was a very undesirable
experience, one we had to react to very quickly with some quick
changes. In that particular case we were lucky, adding a simple
index to support the offending query did the trick. However, it’s
rarely that simple to fix an issue like this, especially when you
have extreme workloads and data growth.

The important point is this: Many applications
today are experiencing Big Data challenges. The problem is becoming
more prevalent all the time, as the nature of applications change
toward taking advantage of the current data
boom.

Therefore it’s important to understand these 3
things:

  • The cause of Big Data performance
    issues;

  • What you can do about these issues in the
    short-term;

  • Available technologies that can be used to
    overcome Big Data challenges for databases of virtually any
    size.

Next month I’ll be covering the common reasons
for Big Data performance degradation, and how to recognize the
symptoms as early as possible so that you can pro-actively respond
to these issues.

Author
Cory Isaacson
Cory Isaacson is CEO/CTO of CodeFutures Corporation, maker of dbShards, a leading database scalability suite providing a true “shared nothing” architecture for relational databases. Cory has authored numerous articles in a variety of publications including SOA Magazine, Database Trends and Applications, and recently authored the book Software Pipelines and SOA (Addison Wesley). Cory has more than twenty years experience with advanced software architectures, and has worked with many of the world’s brightest innovators in the field of high-performance computing. Cory can be reached at: cory.isaacson@codefutures.com
Comments
comments powered by Disqus