Why You Should Migrate from SQL to NoSQL for Time Series Data
Do you use SQL or NoSQL databases for time series data? There are a number of significant differences, advantages, and disadvantages between the two. In this article, you will learn all about the differences between SQL and noSQL databases, as well as the differences for time series use cases.
The majority of people working with time series data nowadays are doing so using noSQL databases. Still, there’s a significant number using relational database management systems (RDBMS) for time series data. In this article, I’ll explain the differences between SQL and noSQL databases in general as well as the specific differences for time series use cases.
An overview of SQL vs. NoSQL databases
One of the most important differences between SQL and noSQL is the schema, or lack thereof. The relational model means you can’t create a SQL database without a predefined schema, where the data is normalized into tables with primary and foreign keys designated, and all the relationships between columns, rows, indices, and tables are defined.
SEE ALSO: Grafana 7.0: “We’ve built one of the best visualisation tools and it’s not tied to any one database”
There are two disadvantages to this approach:
- You must define your schema ahead of time before you can begin creating tables or entering data.
- You cannot change your mind about your schema later.
Both can cause frustration, especially point 2: any kind of change in the structure of your database will be both difficult and disruptive.
NoSQL databases don’t have this problem. The majority of noSQL databases have dynamic schemas, so the schema doesn’t need to be defined up-front. As such, it’s possible to create documents without a predefined structure, and to create a different structure for each document.
Another benefit of noSQL databases is the increased potential for scalability. SQL databases use table structures while noSQL databases typically use key-value stores. This means that SQL-based databases scale vertically (scaling them up requires a beefier server, more processing power, etc.) while noSQL databases tend to scale horizontally (scaling up involves some combination of sharding and adding more servers). Horizontally-scaled systems have the potential to become much more powerful than vertically-scaled ones overall: you’ll hit an upper bound on how much RAM/CPU/GPU/etc. you can add to one server, but additional servers can tile out much farther.
All that being said, noSQL data stores have some downsides as a result of their distributive nature. NoSQL databases follow the CAP Theorem which states that it is impossible for a distributed database to simultaneously provide consistency, availability, and partition tolerance.
Finally, the most significant downside of noSQL data stores comes from their lack of history compared with SQL ones.
Structured Query Language has been around for about 40 years now. As a result, it’s the basis for a lot of databases, both relational (ex. MySQL, SQL Server) and non-relational (ex. Postgres). Because all these different databases — each with many thousands of users — use SQL, it’s extremely stable and well-supported. By contrast, it can be difficult to find expert developers for many noSQL databases, and the selection of third-party consultants to help with complex implementations is more limited.
Because of this, it’s critical for whatever noSQL database you wind up using to have excellent community support, because that’s what you will end up relying on in the absence of many consultants.
Using SQL vs. NoSQL for time series use cases
Gaining high performance for time series from a SQL database requires significant customization and configuration. Without that, unless you’re working with a very small dataset, a SQL-based database will simply not work properly. As an example, un-configured PostgreSQL’s ingest rate will dip significantly at 100M+ total dataset rows, and become near-unusable at 400M+ rows.
Configuring a SQL database — or, for that matter, any non-time-series-optimized database (i.e. MongoDB, Cassandra, etc.) — for time series data is not trivial. On the other hand, a purpose-built time series database comes equipped to work with time series data right out of the metaphorical box.
Time series data is unique. Time series data is almost always monitoring data and is usually collected to determine if a system, host, environment, patient, etc. is healthy. In order to optimize for time series use cases, several assumptions and tradeoffs were made during the design of time series database, InfluxDB. For example, data is added in time-ascending order. Additionally, when deletes occur, it’s typically against a large range of old data. These assumptions enable InfluxDB to have high query and write performance — and led to several design decisions that promote InfluxDB’s high performance. InfluxDB anticipates and supports high cardinality use cases by storing the index on disk. Furthermore, it is written in Go for fast and concurrent data fetching with an easy binary deploy. Finally, it is worth noting that InfluxDB is an exception in the NoSQL world because it is a hybrid non-strict CA/AP database — it is consistent, available, and partition-tolerant.
SQL databases are typically focused on CRUD — creating, reading, updating, and deleting data, all in equal measure. In addition, they’re typically designed with the ACID principles (Atomicity, Consistency, Isolation, Durability) in mind. NoSQL, on the other hand, tends to be looser with these principles.
Each database type comes with tradeoffs that are optimized for their use cases. Since time series data comes in time order and is typically collected in real-time, time series databases are immutable and append-only to accommodate for extremely high volumes of data. The append-only property of time series databases distinguishes time series databases from relational databases, which are optimized for transactions but only accommodate lower ingest volumes. In general, depending on their particular use case, noSQL databases will trade off the ACID principles for BASE model (whose principles are Basic Availability, Soft State and Eventual Consistency). For example, one individual point in a time series is fairly useless in isolation, and the important thing is the trend in aggregate.
One of the barriers to choosing a noSQL database is the difficulty of learning a non-SQL query language and performing complex data transformations. To help with this, a database such as InfluxDB has Flux lang — an easy-to-learn, standalone data scripting and query language that increases productivity and code reuse. Users can easily perform data transformation and apply complicated math with intuitive syntax.
For example, to query our data, “my_data”, from our bucket, “my_bucket”, while specifying a measurement, “my_measurement” for the last 7 days, we would use the following Flux query:
from(bucket: "my-bucket") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "my_measurement") |> filter(fn: (r) => r["_field"] == "my_data")
Flux enables users to interact with time series in a table format, giving InfluxDB a relational feel. With Flux, users can perform cross-measurement joins, pivots, maps, and math across measurements. InfluxDB Templates and user defined Flux packages make the barrier to Flux adoption extremely low. InfluxDB Templates prepackage configurations that contain everything from Flux queries, to dashboards and Telegraf configurations, to notifications and alerts — thereby enabling users to use Flux before they even know how to write it. User defined Flux packages allow users to write and share custom-functions that wrap up complicated Flux scripts.
Even though data structures vary between noSQL and SQL databases, almost all of the above query should feel intuitive to the SQL developer, except for maybe the bucket. A bucket is where time series data is stored. All buckets have retention policies. A retention policy describes how long a database keeps data. As an example, InfluxDB compares your local server’s timestamp to the timestamps on your data and deletes data that is older. Automatically expiring and downsampling old data is critical for time series use cases because of the high-volume nature of time series problems. However, creating and enabling these recurring tasks are challenging in a SQL environment.
Specifically, SQL developers would need to write code to shard the data across the cluster, aggregate and downsampling functions, data eviction and lifecycle management, and summarization. They’d also have to create an API to write and query their new service. Furthermore, developers using Cassandra or HBase will need to write tools for data collection. They’ll need to introduce a real-time processing system and write code for monitoring and alerting. Finally, they’ll need to write a visualization engine to display the time series data to the user.