Evaluating NoSQL performance: Which database is right for your data?
Often referred to as NoSQL, non-relational databases feature elasticity and scalability. In addition, they can store big data and work with cloud computing systems. All of these factors make them extremely popular. In 2013, the number of NoSQL products reached 150 plus, and the figure is still growing. That variety makes it difficult to select just one. To make things worse, the abundance of marketing materials describing NoSQL products makes it hard to understand whether a particular solution will be useful for your use case.
To choose the best NoSQL solutions for our customer’s projects, we at Altoros test them under varying types of workloads. This article overviews the results of the latest performance tests of some of the most mature and popular NoSQL data stores on the market.
NoSQL data stores: The basics
Why did NoSQL data stores appear? Mostly because relational databases (RDBMS) have a number of restrictions when working with large datasets. For example, RDBMS are hard to scale and their architecture is designed to work on a single machine when:
-Scaling write operations is hard, expensive, or impossible.
-Vertical scaling (or upgrading equipment) is either limited or very expensive. Unfortunately, this is often the only possible way you can scale.
-Horizontal scaling (or adding new nodes to the cluster) is either unavailable or you can only implement it partially. There are some solutions from Oracle and Microsoft that make it possible to have computing instances on several servers. Still, the database itself remains in shared storage.
In addition to poor scalability, RDBMS have strict data models. The schema is created together with the database and you may need significant time and effort to change this structure. In most cases it is a complex process that most likely involves a considerable amount of production downtime. Apart from that, RDBMS have difficulties with semi-structured data.
NoSQL solutions address these and a lot of other problems. Several types exist - key-value, columnar, document-oriented, and graph. None of them use the relational data model, being inherently schema-free, without obsessive complexity, with a flexible data model and eventual consistency (complying with BASE rather than ACID).
NoSQL data stores provide APIs to perform various operations. Some of them support query language operations, for example, Cassandra and HBase. However, there is no standard. NoSQL architectures are designed to run in clusters. This makes it possible to scale them horizontally by increasing the number of nodes in the deployment. In addition, NoSQL data stores serve huge amounts of data and provide a very high throughput.
How do you evaluate NoSQL data stores?
NoSQL databases differ from RDBMS in their data models. These systems can be divided into four distinct groups:
Key-value stores are similar to maps or dictionaries where data is addressed by a unique key.
Document-oriented stores encapsulate key value pairs in JSON or JSON like documents. Within documents, keys have to be unique. In contrast to key-value stores, values are not opaque to the system and can be queried, as well.
Column family stores are also known as column oriented stores, extensible record stores, and wide columnar stores.
Graph databases—in contrast to key oriented NoSQL databases, graph databases are specialized in efficient management of heavily linked data.
NoSQL databases differ in the way they distribute data among multiple machines. Since data models of key-value stores, document stores, and column family stores are key oriented, the two common partition strategies are based on keys, too:
1. The first strategy distributes datasets by the range of their keys. A routing server splits the whole keyset into blocks and allocates these blocks to different nodes. Afterwards, one node is responsible for storage and request handling of his specific key ranges. In order to find a certain key, clients have to contact the routing server to get the partition table.
2. Higher availability and much simpler cluster architecture can be achieved with the second type of distribution.
Replication is another “core” feature of NoSQL solutions. In addition to better read performance through load balancing, replication brings better availability and durability, because failing nodes can be easily replaced.
If all replicas of a master server were updated synchronously, the system would not be available until all slaves had committed a write operation. That is why this solution is not suitable for platforms relying on high availability, because even a few milliseconds of latency can have a big influence on user behavior.
And obviously, performance is also a very important factor. Performance of data storage solutions can be evaluated using typical scenarios. These scenarios simulate the most common operations performed by applications that use the data store, also known as typical workloads. The tests that were performed by Altoros to compare performance of several NoSQL data stores also used typical workloads.
Performance evaluation approach
Database vendors usually measure productivity of their products with custom hardware and software settings designed to demonstrate the advantages of their solutions. In our tests we tried to see how NoSQL data stores perform under the equal conditions.
For benchmarking, we used the Yahoo Cloud Serving Benchmark (YCSB). The kernel of YCSB has a framework with a workload generator that creates test workload and a set of workload scenarios. When using YCSB, developers have to describe the scenario of the workload by operation type, i.e. indicate what operations are performed on what types of records. Supported operations include: insert, update (change one of the fields), read (one random field or all the field of one record), and scan (read the records in the order of the key starting from a randomly selected record).
In our tests each workload was applied to a table of 100,000,000 records; each record was 1,000 bytes in size and contained 10 fields. A primary key identified each record, which was a string, such as, for example, “user234123”. Each field was named field0, field1, and so on. The values in each field were random strings of ASCII characters, 100 bytes each.
Database performance was defined by the speed at which it computed basic operations. An action performed by the workload executor, which drives multiple client threads, was considered to be a basic operation. In a NoSQL data store, each thread executes a sequential series of operations by making calls to the database interface layer both to load the database (the load phase) and to execute the workload (the transaction phase). The threads throttle the rate at which they generate requests, so that we may directly control the load. In addition, the threads measure the latency and throughput achieved when performing operations. This data is then sent to the statistics module.
For our test, we decided to use the AWS public cloud environment. All the virtual machines operated in one region (Ireland, Europe) and in one availability zone (or one datacenter). Each database stored data in a four-node cluster. We used m1.xlarge computing instances. The nodes were 64-bit instances with 16 GB of RAM, 4 vCPU, 8 ECU, and high-performance network. We used Amazon Linux as the operating system.
For each node in the cluster, we created data storage. We used four EBS optimized elastic block storage volumes (EBS) of 25 GB each. The volumes were assembled in RAID0. Data were distributed or sharded into 4 nodes.
The client with the YCSB framework was on a separate c1.xlarge instance. For MongoDB, we used two additional c1.medium instances that served as routers. This was necessary due to the specifics of MongoDB's architecture.
We started all the instances with the same security group and preconfigured all the necessary network ports for the nodes to communicate. We also configured all the ports required for each database to be opened. Then we uploaded workload definitions to the YCSB client and performed the tests. The results were measured at the instance where YCSB was installed.
Figure 1: The infrastructure for testing NoSQL data stores. Source: Altoros
Databases to evaluate and workload definitions
In our test we measured the performance of several NoSQL databases that we consider to be the most mature and popular products on the market. Let’s take a closer look at each of them:
Cassandra 2.0. This is a column-value data store. We ran it with Java 1.7.40 installed. The transactions were performed with the non-default configuration. In particular, we used a random partitioner to section data by nodes. The amount of data cash for the keys was 1 GB. The size of row cash was 6 GB. The size of JVM heap was 6 GB.
MongoDB 2.4.6. This is a document-oriented database. Here, we did not do much additional configuration or tuning—we added two instances that served as routers, as recommended by MongoDB documentation. However, if you need to simplify the model, mongo router may run on the same machine where the YCSB client is. In one of our earlier tests, we discovered that it uses a lot of CPU. This is why we placed router processes on 2 separate machines. Data sharding for MongoDB was based on document key.
HBase 0.92. For HBase, we set the size of memory for the JVM to 12 GB.
Additionally, we used data compression with the Snappy compressor for Cassandra and HBase.
The replica factor was set to one for all data stores. This approach was intentional - we wanted to test performance, not the failure tolerance of the cluster.
We used the following workloads with the YSCB framework:
Workload A. Workload A is an update-heavily scenario that simulates how a database works, when recording typical actions of an e-commerce solution user.
Workload B. Workload B is a read-mostly workload that has a 95/5 (ninety five to five percent) read/update ratio. It recaps content tagging when adding a tag is an update, but most operations include reading tags.
Workload C. Workload C is a read-only workload that simulates a data caching layer, for example a user profile cache.
Workload D. Workload D has a 95/5 read/insert ratio. The workload simulates accessing the latest data, such as user status updates, or working with inbox messages.
Workload E. Workload E is a scan-short-ranges workload with a scan/insert percentile proportion of 95/5. It simulates threaded conversations that are clustered by a thread ID. Each scan is performed for the posts of a given thread.
Workload F. Workload F has read-modify-write/read ops in a proportion of 50/50. It simulates accessing a user database where records are read and modified by the user. This activity is also recorded to this database.
Workload G. Workload G has a 10/90 read/insert ratio. It simulates a process of data migration or a process of creating large amounts of data.
Every workload was executed by 100 concurrent threads. The dataset consisted of 100,000,000 records and the number of operations that were divided between threads was 10,000.
During the first stage of the test, the load phase, we uploaded 100,000,000 records of 1 Kb each to every data store. YCSB measured the average throughput in operations per second and average latency of operations in milliseconds. The next diagram displays the results of the load phase:
Figure 2: The results of the load phase. Source: Altoros
HBase demonstrated the lowest throughput, probably because we turned on the auto-flash mode. This mode ensures that each operation that creates a record will be sent from the client to the server and then persisted to the database. HBase also supports an alternative mode that uses additional cash on the client side. When the client is out of client cache it sends data from the cash to the server. In this alternative mode, HBase saves data to disk in batches.
As we expected, Cassandra demonstrated excellent results with almost 18,000 operations per second. This is due to Cassandra’s architecture. It simultaneously updates data in memory and writes it to the transaction journal on the disk. This guarantees data persistency should a node crash.
The number of operations per second in MongoDB’s results was pretty close to that of Hbase - the average latency was around seven milliseconds at 13,000 operations per second.
In this particular test, all data was loaded in a single iteration, but insert, update, read, and scan operations in the transaction phase of the test were performed in five iterations for each workload (for every database).
It should be noted that many of the diagrams with test results demonstrate that database performance is limited and starts to decline at a certain throughput level. Also we need to mention that we used Amazon AWS and network storage which could potentially influence the results.
Workload A includes read and update operations in a ratio of 50/50. It simulates an e-commerce application. This slide shows the results of update operations.
Figure 3: The results of update operations in Workload A. Source: Altoros
Cassandra and HBase demonstrated good performance with a throughput below 20 milliseconds.
MongoDB’s latency increased substantially as we increased the workload. At 100 operations per second all the four databases had similar performance. But when the workload reached 1,500 operations per second, MongoDB’s latency increased to 100 milliseconds.
The next diagram shows results of read operations in Workload A.
Figure 4: Figure 4. The results of read operations for Workload A. Source: Altoros
All the graphs are very different because reads and updates were randomly distributed. The results for read operations in Workload A were more or less similar in all the tested solutions. The difference in latencies was insignificant, within a range of 15-30 milliseconds.
Workload B is read-mostly with 95% of reads and only 5% of updates. It simulates content tagging when adding a tag is an update, but most other transactions are reads. Here are the results for update operations in workload B.
Figure 5. The results of update operations for Workload B. Source: Altoros
Cassandra demonstrates a very low latency, but her performance is limited to 1200 operations per second. With HBase, the latency increases evenly as the workload grows. The behavior of MongoDB is similar to the previous test where the latency increased together with the throughput.
The next slide shows the results of read operations that make up 95% of Workload B.
Figure 6. The results of read operations for Workload B. Source: Altoros
HBase demonstrated the highest throughput with several peaks in latency. MongoDB also had good results thanks to its architecture, which features memory-mapped files. On this diagram, there are points where the maximum throughput starts to degrade. They were left here on purpose.
Workload C is a read-only workload. MongoDB’s performance does not degrade as the workload increases. Latency peaks on the graphs for Cassandra and HBase can be explained by the fact that we used cloud infrastructure to run the tests and network storage for the data nodes.
Figure 7. The results of read operations for Workload C. Source: Altoros
Workload D includes a 5% of insert and 95% of read operations. It simulates users checking inbox messages or accessing latest data, for example, status updates. The following slide shows the results of insert operations.
Figure 8. The results of insert operations for Workload D. Source: Altoros
Cassandra demonstrated the best performance. The latencies were within five milliseconds. This is similar to what we saw at the upload stage where Cassandra was pretty efficient.
The maximum throughput for HBase is 1500 operations per second.
Mongo has an acceptable throughput of up to 2500 ops per second. At the same time, the average latency doubles.
Read operations are the second scenario in Workload D. Cassandra and MongoDB achieved good results. For HBase, the throughput was limited to 1500 operations per second. The maximum throughput in this mixed workload was demonstrated by MongoDB.
Figure 9. The results of read operations for Workload D. Source: Altoros
Workload F consisted of randomly distributed read and complex read/modify/write operations. Each record was read, changed, and then saved to the database.
The first diagram shows the results of read operations only. All the databases demonstrated similar results in this test. Cassandra and HBase had maximum throughput values of about 1,500 operations per second (for read operations). Performance of MongoDB starts to decline at less than 1,000 operations per second.
Figure 10. The results of read operations for Workload F. Source: Altoros
The second diagram for Workload F shows the results of update operations performed on data that has already been read.
Hbase and Cassandra have very low latencies. As you can see on all the previous graphs, update operations are performed pretty fast by all the databases, except for MongoDB. As we increase the workload, MongoDB’s latency starts to grow substantially.
Figure 11. The results of read operations for Workload F. Source: Altoros
The third diagram for Workload F shows the results of read-modify-write operations. Cassandra and HBase have similar performance. MongoDB’s latency increases together with the workload, up to the maximum throughput of roughly 1,000 operations per second.
We can conclude that Cassandra and Hbase are pretty good at dealing with mixed workloads.
Figure 12. The results of read-modiry-write operations for Workload F. Source: Altoros
The last workload, Workload G, mostly consists of insert operations. It simulates the process of data migration or inserting a lot of data into a database. The results are similar to what you can see on the previous graphs. HBase and Cassandra demonstrate low latencies and a high throughput. MongoDB’s performance starts to decline at about 4,000 ops per second with an average latency up to five times greater than in other databases.
Figure 13. The results of insert operations for Workload G. Source: Altoros
Below is the last diagram, showing the results of the read operations which make up 10% of workload G. Here we can see that latencies vary for different solutions. This might be because the data is in network storage on the cloud. Cassandra shows a maximum throughput of up to 7,000 operations per second and MongoDB’s throughput is limited by 4,000 operations per second.
Figure 14. The results of read operations for Workload G. Source: Altoros
What you choose depends on your needs. Before making the decision you should:
Determine what your data sets and your data model will be like; the data model will depend on the data sets and typical operations that your app will perform
Determine your requirements to transaction support; decide whether you need transactions
Choose whether you need replication; decide on your requirements to data consistency
Determine your performance requirements
If the project is based on an existing solution, evaluate if it is possible to migrate existing data
Then, taking into account all these factors, evaluate different solutions and test their performance. It is very useful to build a prototype of your future system during the proof-of-concept phase. Prototyping makes it possible to see how the solution will work in a real-life project. If it does not work well enough, you need to review the architecture, the components, and build a new prototype.
There are no perfect solutions and there are no bad NoSQL or RDBMS data stores. Which database is the best for your use case can be determined by particular system requirements. The tests performed by Altoros show that in different scenarios different solutions have very different results. Your final choice might be a compromise. The main determinant will be what you want to achieve and what properties you need most. The system may use several solutions, including relational and NoSQL databases.
Sergey Sverchkov: Head of Software Development, Altoros
With more than 15 years in IT industry and strong experience in the whole cycle of software project development and support, Sergey serves as Senior R&D Engineer and Project Manager at Altoros. He specializes in Big Data analysis, integration and processing. Sergey is an Oracle DBA trainer, and a frequent speaker on data analysis and cloud technologies.