Choose wisely

Evaluating NoSQL performance: Which database is right for your data?

SergeySverchkov
choice

Sergey Sverchkov explains how you can cut through the masses of NoSQL marketing jargon and find the perfect database for your project.

Often referred to as NoSQL, non-relational databases
feature elasticity and scalability. In addition, they can store big
data and work with cloud computing systems. All of these factors
make them extremely popular. In 2013, the number of NoSQL products
reached 150 plus, and the figure is still growing. That variety
makes it difficult to select just one. To make things worse, the
abundance of marketing materials describing NoSQL products makes it
hard to understand whether a particular solution will be useful for
your use case.

To choose the best NoSQL solutions for our customer’s
projects, we at Altoros test them under varying types of workloads.
This article overviews the results of the latest performance tests
of some of the most mature and popular NoSQL data stores on the
market.

NoSQL data stores: The basics

Why did NoSQL data stores appear? Mostly because
relational databases (RDBMS) have a number of restrictions when
working with large datasets. For example, RDBMS are hard to scale
and their architecture is designed to work on a single machine
when:

-Scaling write operations is hard, expensive, or
impossible.

-Vertical scaling (or upgrading equipment) is either
limited or very expensive. Unfortunately, this is often the only
possible way you can scale.

-Horizontal scaling (or adding new nodes to the
cluster) is either unavailable or you can only implement it
partially. There are some solutions from Oracle and Microsoft that
make it possible to have computing instances on several servers.
Still, the database itself remains in shared storage.

In addition to poor scalability, RDBMS have strict
data models. The schema is created together with the database and
you may need significant time and effort to change this structure.
In most cases it is a complex process that most likely involves a
considerable amount of production downtime. Apart from that, RDBMS
have difficulties with semi-structured data.

NoSQL solutions address these and a lot of other
problems. Several types exist – key-value, columnar,
document-oriented, and graph. None of them use the relational data
model, being inherently schema-free, without obsessive complexity,
with a flexible data model and eventual consistency (complying with
BASE rather than ACID).

NoSQL data stores provide APIs to perform various
operations. Some of them support query language operations, for
example, Cassandra and HBase. However, there is no standard. NoSQL
architectures are designed to run in clusters. This makes it
possible to scale them horizontally by increasing the number of
nodes in the deployment. In addition, NoSQL data stores serve huge
amounts of data and provide a very high throughput.

How do you evaluate NoSQL data stores?

NoSQL databases differ from RDBMS in their data models.
These systems can be divided into four distinct groups:

  1. Key-value stores are similar to maps or
    dictionaries where data is addressed by a unique key.

  2. Document-oriented stores encapsulate key value
    pairs in JSON or JSON like documents. Within documents, keys have
    to be unique. In contrast to key-value stores, values are not
    opaque to the system and can be queried, as well.

  3. Column family stores are also known as column
    oriented stores, extensible record stores, and wide columnar
    stores.

  4. Graph databases—in contrast to key oriented
    NoSQL databases, graph databases are specialized in efficient
    management of heavily linked data.

NoSQL databases differ in the way they distribute data
among multiple machines. Since data models of key-value stores,
document stores, and column family stores are key oriented, the two
common partition strategies are based on keys, too:

1. The first strategy distributes datasets by the range of
their keys. A routing server splits the whole keyset into blocks
and allocates these blocks to different nodes. Afterwards, one node
is responsible for storage and request handling of his specific key
ranges. In order to find a certain key, clients have to contact the
routing server to get the partition table.

 
2. Higher availability and much simpler cluster architecture
can be achieved with the second type of distribution.

Replication is another “core” feature of NoSQL
solutions. In addition to better read performance through load
balancing, replication brings better availability and durability,
because failing nodes can be easily replaced.

If all replicas of a master server were updated
synchronously, the system would not be available until all slaves
had committed a write operation. That is why this solution is not
suitable for platforms relying on high availability, because even a
few milliseconds of latency can have a big influence on user
behavior.

And obviously, performance is also a very important
factor. Performance of data storage solutions can be evaluated
using typical scenarios. These scenarios simulate the most common
operations performed by applications that use the data store, also
known as typical workloads. The tests that were performed by
Altoros to compare performance of several NoSQL data stores also
used typical workloads.

Performance evaluation approach

Database vendors usually measure productivity of their
products with custom hardware and software settings designed to
demonstrate the advantages of their solutions. In our tests we
tried to see how NoSQL data stores perform under the equal
conditions.

For benchmarking, we used the Yahoo Cloud Serving
Benchmark (YCSB). The kernel of YCSB has a framework with a
workload generator that creates test workload and a set of workload
scenarios. When using YCSB, developers have to describe the
scenario of the workload by operation type, i.e. indicate what
operations are performed on what types of records. Supported
operations include: insert, update (change one of the fields), read
(one random field or all the field of one record), and scan (read
the records in the order of the key starting from a randomly
selected record).

In our tests each workload was applied to a table of
100,000,000 records; each record was 1,000 bytes in size and
contained 10 fields. A primary key identified each record, which
was a string, such as, for example, “user234123”. Each field was
named field0, field1, and so on. The values in each field were
random strings of ASCII characters, 100 bytes each.

Database performance was defined by the speed at which
it computed basic operations. An action performed by the workload
executor, which drives multiple client threads, was considered to
be a basic operation. In a NoSQL data store, each thread executes a
sequential series of operations by making calls to the database
interface layer both to load the database (the load phase) and to
execute the workload (the transaction phase). The threads throttle
the rate at which they generate requests, so that we may directly
control the load. In addition, the threads measure the latency and
throughput achieved when performing operations. This data is then
sent to the statistics module.

Testing environment

For our test, we decided to use the AWS public cloud
environment. All the virtual machines operated in one region
(Ireland, Europe) and in one availability zone (or one datacenter).
Each database stored data in a four-node cluster. We used m1.xlarge
computing instances. The nodes were 64-bit instances with 16 GB of
RAM, 4 vCPU, 8 ECU, and high-performance network. We used Amazon
Linux as the operating system.

For each node in the cluster, we created data storage.
We used four EBS optimized elastic block storage volumes (EBS) of
25 GB each. The volumes were assembled in RAID0. Data were
distributed or sharded into 4 nodes.

The client with the YCSB framework was on a separate
c1.xlarge instance. For MongoDB, we used two additional c1.medium
instances that served as routers. This was necessary due to the
specifics of MongoDB’s architecture.

We started all the instances with the same security
group and preconfigured all the necessary network ports for the
nodes to communicate. We also configured all the ports required for
each database to be opened. Then we uploaded workload definitions
to the YCSB client and performed the tests. The results were
measured at the instance where YCSB was installed.

Figure 1: The infrastructure for testing NoSQL data
stores. Source: Altoros

Databases to evaluate and workload definitions

In our test we measured the performance of several
NoSQL databases that we consider to be the most mature and popular
products on the market. Let’s take a closer look at each of
them:

  • Cassandra 2.0. This is a column-value data
    store. We ran it with Java 1.7.40 installed. The transactions were
    performed with the non-default configuration. In particular, we
    used a random partitioner to section data by nodes. The amount of
    data cash for the keys was 1 GB. The size of row cash was 6 GB. The
    size of JVM heap was 6 GB.

  • MongoDB 2.4.6. This is a document-oriented
    database. Here, we did not do much additional configuration or
    tuning—we added two instances that served as routers, as
    recommended by MongoDB documentation. However, if you need to
    simplify the model, mongo router may run on the same machine where
    the YCSB client is. In one of our earlier tests, we discovered that
    it uses a lot of CPU. This is why we placed router processes on 2
    separate machines. Data sharding for MongoDB was based on document
    key.

  • HBase 0.92. For HBase, we set the size of memory
    for the JVM to 12 GB.

Additionally, we used data compression with the Snappy
compressor for Cassandra and HBase.

The replica factor was set to one for all data stores.
This approach was intentional – we wanted to test performance, not
the failure tolerance of the cluster.

We used the following workloads with the YSCB
framework:

  • Workload A. Workload A is an update-heavily
    scenario that simulates how a database works, when recording
    typical actions of an e-commerce solution user.

  • Workload B. Workload B is a read-mostly workload
    that has a 95/5 (ninety five to five percent) read/update ratio. It
    recaps content tagging when adding a tag is an update, but most
    operations include reading tags.

  • Workload C. Workload C is a read-only workload
    that simulates a data caching layer, for example a user profile
    cache.

  • Workload D. Workload D has a 95/5 read/insert
    ratio. The workload simulates accessing the latest data, such as
    user status updates, or working with inbox messages.

  • Workload E. Workload E is a scan-short-ranges
    workload with a scan/insert percentile proportion of 95/5. It
    simulates threaded conversations that are clustered by a thread ID.
    Each scan is performed for the posts of a given thread.

  • Workload F. Workload F has
    read-modify-write/read ops in a proportion of 50/50. It simulates
    accessing a user database where records are read and modified by
    the user. This activity is also recorded to this
    database.

  • Workload G. Workload G has a 10/90 read/insert
    ratio. It simulates a process of data migration or a process of
    creating large amounts of data.

Every workload was executed by 100 concurrent threads.
The dataset consisted of 100,000,000 records and the number of
operations that were divided between threads was 10,000.

Load phase

During the first stage of the test, the load phase, we
uploaded 100,000,000 records of 1 Kb each to every data store. YCSB
measured the average throughput in operations per second and
average latency of operations in milliseconds. The next diagram
displays the results of the load phase:

Figure 2: The results of the load phase. Source:
Altoros

HBase demonstrated the lowest throughput, probably
because we turned on the auto-flash mode. This mode ensures that
each operation that creates a record will be sent from the client
to the server and then persisted to the database. HBase also
supports an alternative mode that uses additional cash on the
client side. When the client is out of client cache it sends data
from the cash to the server. In this alternative mode, HBase saves
data to disk in batches.

As we expected, Cassandra demonstrated excellent
results with almost 18,000 operations per second. This is due to
Cassandra’s architecture. It simultaneously updates data in memory
and writes it to the transaction journal on the disk. This
guarantees data persistency should a node crash.

The number of operations per second in MongoDB’s
results was pretty close to that of Hbase – the average latency was
around seven milliseconds at 13,000 operations per second.

In this particular test, all data was loaded in a
single iteration, but insert, update, read, and scan operations in
the transaction phase of the test were performed in five iterations
for each workload (for every database).

It should be noted that many of the diagrams with test
results demonstrate that database performance is limited and starts
to decline at a certain throughput level. Also we need to mention
that we used Amazon AWS and network storage which could potentially
influence the results.

Workload A

Workload A includes read and update operations in a
ratio of 50/50. It simulates an e-commerce application. This slide
shows the results of update operations.

Figure 3: The results of update operations in Workload
A. Source: Altoros

Cassandra and HBase demonstrated good performance with
a throughput below 20 milliseconds.

MongoDB’s latency increased substantially as we
increased the workload. At 100 operations per second all the four
databases had similar performance. But when the workload reached
1,500 operations per second, MongoDB’s latency increased to 100
milliseconds.

The next diagram shows results of read operations in
Workload A.

Figure 4: Figure 4. The results of read operations for
Workload A. Source: Altoros

All the graphs are very different because reads and
updates were randomly distributed. The results for read operations
in Workload A were more or less similar in all the tested
solutions. The difference in latencies was insignificant, within a
range of 15-30 milliseconds.

Workload B

Workload B is read-mostly with 95% of reads and only
5% of updates. It simulates content tagging when adding a tag is an
update, but most other transactions are reads. Here are the results
for update operations in workload B.

Figure 5. The results of update operations for
Workload B. Source: Altoros

Cassandra demonstrates a very low latency, but her
performance is limited to 1200 operations per second. With HBase,
the latency increases evenly as the workload grows. The behavior of
MongoDB is similar to the previous test where the latency increased
together with the throughput.

The next slide shows the results of read operations
that make up 95% of Workload B.

Figure 6. The results of read operations for Workload
B. Source: Altoros

HBase demonstrated the highest throughput with several
peaks in latency. MongoDB also had good results thanks to its
architecture, which features memory-mapped files. On this diagram,
there are points where the maximum throughput starts to degrade.
They were left here on purpose.

Workload C

Workload C is a read-only workload. MongoDB’s
performance does not degrade as the workload increases. Latency
peaks on the graphs for Cassandra and HBase can be explained by the
fact that we used cloud infrastructure to run the tests and network
storage for the data nodes.

Figure 7. The results of read operations for Workload
C. Source: Altoros

Workload D

Workload D includes a 5% of insert and 95% of read
operations. It simulates users checking inbox messages or accessing
latest data, for example, status updates. The following slide shows
the results of insert operations.

Figure 8. The results of insert operations for
Workload D. Source: Altoros

Cassandra demonstrated the best performance. The
latencies were within five milliseconds. This is similar to what we
saw at the upload stage where Cassandra was pretty efficient.

The maximum throughput for HBase is 1500 operations
per second.

Mongo has an acceptable throughput of up to 2500 ops
per second. At the same time, the average latency doubles.

Read operations are the second scenario in Workload D.
Cassandra and MongoDB achieved good results. For HBase, the
throughput was limited to 1500 operations per second. The maximum
throughput in this mixed workload was demonstrated by MongoDB.

Figure 9. The results of read operations for Workload
D. Source: Altoros

Workload F

Workload F consisted of randomly distributed read and
complex read/modify/write operations. Each record was read,
changed, and then saved to the database.

The first diagram shows the results of read operations
only. All the databases demonstrated similar results in this test.
Cassandra and HBase had maximum throughput values of about 1,500
operations per second (for read operations). Performance of MongoDB
starts to decline at less than 1,000 operations per second.

Figure 10. The results of read operations for Workload
F. Source: Altoros

The second diagram for Workload F shows the results of
update operations performed on data that has already been read.

Hbase and Cassandra have very low latencies. As you
can see on all the previous graphs, update operations are performed
pretty fast by all the databases, except for MongoDB. As we
increase the workload, MongoDB’s latency starts to grow
substantially.

Figure 11. The results of read operations for Workload
F. Source: Altoros

The third diagram for Workload F shows the results of
read-modify-write operations. Cassandra and HBase have similar
performance. MongoDB’s latency increases together with the
workload, up to the maximum throughput of roughly 1,000 operations
per second.

We can conclude that Cassandra and Hbase are pretty
good at dealing with mixed workloads.

Figure 12. The results of read-modiry-write operations
for Workload F. Source: Altoros

Workload G

The last workload, Workload G, mostly consists of
insert operations. It simulates the process of data migration or
inserting a lot of data into a database. The results are similar to
what you can see on the previous graphs. HBase and Cassandra
demonstrate low latencies and a high throughput. MongoDB’s
performance starts to decline at about 4,000 ops per second with an
average latency up to five times greater than in other
databases.

Figure 13. The results of insert operations for
Workload G. Source: Altoros

Below is the last diagram, showing the results of the
read operations which make up 10% of workload G. Here we can see
that latencies vary for different solutions. This might be because
the data is in network storage on the cloud. Cassandra shows a
maximum throughput of up to 7,000 operations per second and
MongoDB’s throughput is limited by 4,000 operations per second.

Figure 14. The results of read operations for Workload
G. Source: Altoros

Conclusion

What you choose depends on your needs. Before making
the decision you should:

  • Determine what your data sets and your data
    model will be like; the data model will depend on the data sets and
    typical operations that your app will perform

  •  Determine your requirements to transaction
    support; decide whether you need transactions

  •  Choose whether you need replication;
    decide on your requirements to data consistency

  •  Determine your performance
    requirements

  •  If the project is based on an existing
    solution, evaluate if it is possible to migrate existing
    data

Then, taking into account all these factors, evaluate
different solutions and test their performance. It is very useful
to build a prototype of your future system during the
proof-of-concept phase. Prototyping makes it possible to see how
the solution will work in a real-life project. If it does not work
well enough, you need to review the architecture, the components,
and build a new prototype.

There are no perfect solutions and there are no bad
NoSQL or RDBMS data stores. Which database is the best for your use
case can be determined by particular system requirements. The tests
performed by Altoros show that in different scenarios different
solutions have very different results. Your final choice might be a
compromise. The main determinant will be what you want to achieve
and what properties you need most. The system may use several
solutions, including relational and NoSQL databases.

Sergey Sverchkov: Head of Software
Development, Altoros

With more than 15 years in IT industry and strong
experience in the whole cycle of software project development and
support, Sergey serves as Senior R&D Engineer and Project
Manager at Altoros. He specializes in Big Data analysis,
integration and processing. Sergey is an Oracle DBA trainer, and a
frequent speaker on data analysis and cloud technologies.

 

Author
Comments
comments powered by Disqus