A time-series database (TSDB) is a database management system that is optimized to store, process, and analyze time-series data.
Time-series data is a sequence of data points that represent the changes of a measurement or events over a time period. Each data point is always timestamped, and the sequence of data points is indexed or listed by timestamp. The data generated by sensors on industrial equipment, smart device, IT monitoring systems, and stock market trades are all examples of time-series data.
It is possible to process time-series data with relational or NoSQL databases, but purpose-built time-series databases are optimized to handle the special characteristics of time-series data. This means that time-series databases are much more efficient in terms of ingestion rate, query latency, and data compression. In addition, time-series databases include special analytic functions and data management features so that you can develop applications more easily.
What Are the Characteristics of Time-Series Data?
The following characteristics are inherent to time-series data:
- Timestamp: Every data point has a timestamp. The timestamp is the key for computing or analysis.
- Structure: Unlike data from Internet applications, the metrics generated by devices or from monitoring are always structured. They have predefined data types or fixed lengths, and the structure will not change unless the device firmware is updated.
- Stream-like: Data sources generate data at a constant rate like an audio or video stream. These data streams are independent from each other.
- Stable flow: Unlike e-commerce or social media applications, time-series data traffic is stable over time and can be calculated and predicted given the number of data sources and the sampling period.
- Immutability: A time-series data source generates each data point once, never correcting or updating existing data. Time-series data is generally append-only, similar to log data.
How Is Time-Series Data Used?
Time-series data is often used to look for insights into operations, raise alerts based on real-time analysis, and forecast future trends. The following characteristics are found in time-series data applications:
- High write-read ratio: Internet applications like Twitter and LinkedIn have single posts that are read by millions of users, but raw time-series data points are scanned and analyzed mainly by applications and algorithms.
- Retention policy: In general, time-series data is not stored forever. Organizations have a retention policy that defines the data lifecycle, and the data is deleted once its lifecycle is over.
- Real-time analytics and computing: To detect abnormal behavior and raise alerts based on the collected data or aggregation results, time-series data must be computed in real time.
- Query scope: Time-series data is always queried over a period of time or a set of data sources, and filters are used such that not all historical data is queried. In addition, data aggregation is always applied on all or a subset of the data sources with a filter condition.
- Trends: In time-series data, single data points are usually not important. Instead, the focus is on how data trends over a period of time, such as changes in the past hour or day.
Time-series solutions like TDengine optimize their design based on these characteristics, which enables more efficient processing of time-series data and better performance compared with general databases.
Why Does Time-Series Data Require Specialized Databases?
Today, everything is online – meters, cars, elevators, assembly lines, and even bicycles are connected to the Internet. And all of these items are emitting a relentless stream of metrics and events. With the advent of IoT and the cloud, the volume of time-series data has begun growing exponentially in an unprecedented way. The massive size of time-series data sets is a major challenge for general database management systems like relational and NoSQL databases. In particular, the following aspects of time-series data are difficult for non-specialized databases to handle:
- Data ingestion rate: In many time-series data scenarios, millions of data points are produced every second and need to be ingested in real time. Relational databases are not designed to handle this amount of data, and while NoSQL databases can be scaled to handle it, the amount of resources required quickly becomes prohibitive.
- Query latency: Time-series applications often need to scan a huge number of data points to get an aggregation result, which can result in high latency. For example, it would take hours for a general database to calculate the average response time of all clicks on Amazon.com, by which time the aggregation result could be outdated.
- Storage cost: Internet-connected devices and applications are generating data nonstop 24/7 – sometimes exceeding a terabyte in a single day. Because relational and NoSQL databases cannot compress this data efficiently, storage costs can become high very fast.
These issues mainly involve efficiency in processing large data sets, but there are also areas where general databases often do not support even the basic requirements of time-series applications:
- Data lifecycle management: Once time-series data ages out, it is generally removed in batches, not one data point at a time.
- Roll-up: Time-series data is rolled up based on a specified time window and saved into new table. In addition, raw data and rolled-up data may have different lifecycles and retention policies.
- Special analytic functions: Besides the functions provided by general databases, time-series applications need functions like time-weighted average, moving average, cumulative sum, rate of changes, elapsed time for a specific state, and delta between two consecutive data points.
- Interpolation: The database management system must be able to interpolate data based on the adjacent data points and rules in order to regularize data sets when required by applications or algorithms.
- Continuous query: Time-series applications run queries in the background periodically over a sliding time window in order to populate dashboards, generate reports, and downsample data sets.
- Session and state windows: Aggregation and analytic functions may be run over a session or state window, not just time – for example, consider a function that calculates average power consumption only when a machine is in the running state.
With general databases, developers are forced to write custom code to implement these features. Different data workloads require different database solutions – one size does not fit all. For time-series data, no matter the size of your data set, a purpose-built time-series database is the best tool for the job.
Why Are Time-Series Databases (TSDB) Becoming Popular?
Time-series databases are not new: they have been widely used in the financial and process industries for decades.
However, they are becoming popular now mainly due to the rapid growth of the IoT. As more and more devices are Internet-connected and constantly sending data – time-series, of course – to the cloud, an increasing number of sectors are becoming interested in purpose-built time-series databases. As production modernizes and control systems evolve into the Industrial Internet of Things (IIoT), the industrial applications of time-series data are becoming evident as well. Finally, IT infrastructure has been steadily expanding, and everything from servers, containers, and network devices to apps and microservices is being monitored, which also generates massive amounts of time-series data.
Technologically speaking, older time-series databases are often closed systems that use outdated architectures, and they cannot scale to support the growing volume of data. In the old days, a million time-series data points seemed like a huge number, but now millions and even billions of data points is nothing out of the ordinary. Furthermore, integrating legacy time-series solutions with popular data analysis tools like artificial intelligence and machine learning frameworks is difficult, if not impossible. These legacy systems cannot be moved to the cloud without significant effort, and their licensing models are no longer acceptable for modern applications.
The growing market and the limitations of older time-series databases leave space for a new generation of time-series databases. Over the past 10 years, at least 20 new time-series databases have been released on the market, with open-source time-series databases becoming particularly popular. As the following figure indicates, time-series databases are growing faster than any other segment of the database market.