Data Model Comparison Between Time-Series Databases

Jeff Tao
Jeff Tao
/
Share on LinkedIn

Time-series databases (TSDB) are aimed at processing time-stamped data points in an efficient way. The data model is the most important part of any data management system, but different TSDBs adopt different data models. To help you deploy the right TSDB for their application, this blog compares the data models of a few popular TSDBs on GitHub, specifically InfluxDB, TDengine, Timescale, Prometheus, OpenTSDB, and QuestDB.

A time-series data set contains a sequence of timestamped metrics. The metrics are always generated by a single device, sensor, or data collection agent – referred to as a data collection point (DCP). Each DCP has static attributes, like serial number, model, color, hostname, and so on. In a TSDB, these attributes are treated as labels or tags. These tags are dimensional data and can be used to filter, group, and match time series for data analytics.

To make it easier to understand, let’s use connected cars as an example. Each vehicle reports its GPS position (x, y, z coordinates) to a server periodically. Each vehicle has its own unique vehicle identification number (VIN), brand, and model.

In this time series database comparison, the data models of various systems are evaluated based on the following dimensions:

  • Type of data model – relational vs. tag set
  • Column model – single vs. multiple
  • Supported data types – numerical only vs. varied

Tag Set Data Model

While Timescale, TDengine, and QuestDB adopt a relational data model, OpenTSDB, Prometheus and InfluxDB adopt a tag set data model. Because Prometheus uses the same tag set data model as OpenTSDB with only some minor enhancements, OpenTSDB will not be addressed specifically in this section.

In a tag set data model, every time series is uniquely identified by a metric name (measurement in InfluxDB nomenclature) and a set of tags. In our connected cars scenario, each time-series is identified by GPS Coordinates + VIN + Brand + Model.

For Prometheus or OpenTSDB, a time series for connected vehicles could be written like this:

gps_x { vin=”vin1”, brand=”tesla”, model=”s3”}
gps_y { vin=”vin1”, brand=”tesla”, model=”s3”}
gps_z { vin=”vin1”, brand=”tesla”, model=”s3”}

In the case of InfluxDB,  the same time series could be written as the following:

 gps, vin=vin1, brand=tesla, model=s3  

The tag set model resembles NoSQL, where it is not required to define a schema or indices. Indices on tags are created automatically, and indices on metrics are not created at all. In fact, a major shortcoming of the tag set model is that it does not allow the creation of indices on metrics. On the other hand, writing to the database is very simple and it is quite easy to prototype simplistic applications. This makes it very attractive for developers.

To aggregate multiple time series, a tag set filter has to be applied.

If we add another tag or change the value of a tag, it is treated as a different time series. While typically tags are not changed, as IoT grows and devices become more complex and capable of capturing more data, it can be anticipated that tags will in fact be added probably quite frequently. In our simple connected vehicles scenario, if we add color as another dimension for analysis, the tag set model treats it as a new time series even though it is actually not.

Additionally, in the tag set model, tags are always treated as strings, and no other data types are supported. This makes it impossible to filter within a range, for example.  

Another drawback of the tag set data model is that the query language is always proprietary. The previous version of the InfluxDB query language was SQL-like, but the new version is powered by its proprietary Flux query language. Prometheus also has its own query language, PromQL, and OpenTSDB uses its own set of APIs for querying. Proprietary query languages impose an unnecessary learning cost which is better spent on development, especially in the fast-moving IoT world.

Relational Data Model

In applications that use a relational data model, the schema is always defined first. In the case of time-series data, there is always a timestamp column. In addition, each metric has a dedicated column and a data type. There are advantages to this: for example, the design of the schema is useful in validating user input. Requiring schema design at the beginning of the project forces developers to be thoughtful about the data model, which reduces the effort required for analysis and change management later on and also reduces risk.

One of the biggest advantages of the relational data model is the query language – SQL. SQL is both an ISO and an ANSI standard. Timescale, TDengine, and QuestDB all support SQL with some extensions. Since SQL is supported, many analytical tools, BI tools, and graphing tools can be used without any effort and queries can be reused with little or no change.

Defining a schema is viewed as hurdle by developers used to NoSQL models who may want to adopt a relational data model. To overcome this problem, both TDengine and QuestDB support schemaless data ingestion with built-in support for automatic table creation. TDengine supports both the InfluxDB line protocol and the OpenTSDB line and JSON protocols. TDengine also has out-of-the-box support for Telegraf, collectd, StatsD, and other agents which can write data directly into TDengine.

A relational data model, of course, defines a schema for collected metrics. But TDengine, Timescale, and QuestDB handle the tags associated with each data collection point in different ways.

Timescale

For Timescale, we have to create a table to store collected metrics and then convert this table into a hypertable. In addition, we also need to create another table to store the tags. If we want to execute a basic query such as aggregation over a set of time series, we need to perform a JOIN operation between the hypertable and the tags table. In our connected vehicle scenario, the following SQL statements would be executed:

CREATE TABLE vehicles(vin SERIAL PRIMARY KEY, brand VARCHAR(20), model VARCHAR(20));
CREATE TABLE vehicle_data(time TIMESTAMPTZ NOT NULL, vin SERIAL, x DOUBLE PRECISION, y DOUBLE PRECISION, z DOUBLE PRECISION, FOREIGN KEY (vin) REFERENCES vehicles (vin));

To retrieve data for a specific vehicle, the SQL statement would be:

SELECT * FROM vehicle_data WHERE vin=”vin1”;

To query all data for the Tesla Model S3, the SQL statement would be:

SELECT * FROM vehicle_data JOIN vehicles ON vehicle_data.vin= vehicles.vin WHERE brand=“tesla” AND model=”s3”;

TDengine

The TDengine Team have designed a storage engine that is highly optimized for time-series data. To enhance the overall storage and query performance, TDengine requires a table for each data collection point. Furthermore, TDengine introduces a novel concept called the supertable. A supertable is used to describe a category of devices which generate the same metrics. Its schema includes two parts – one part for metrics, and another part for tags. For our connected vehicle scenario, the supertable is created as follows:

CREATE STABLE vehicle (ts TIMESTAMP, x DOUBLE, y DOUBLE, z DOUBLE) TAGS (brand BINARY(16), model BINARY(16));

To create the table for each specific data collection point, we use the supertable as a template and assign the tag values. For example, for a Tesla Model S3 with VIN vin123, the table could be created as follows:

CREATE TABLE vin123 USING vehicle TAGS (“tesla”, “s3);

Unlike a regular table, the table for each data collection point includes a set of tags. The table name is unique in the database. In addition, we can use the VIN as the table name because one table is created for each data collection point. To query the data on a specific vehicle, the SQL statement will be:

SELECT * FROM vin123;

To run a query over a set of vehicles, we query the supertable instead of a specific table. The SQL statement is as follows:

SELECT * FROM vehicle WHERE brand = “tesla” AND model = “s3”;

By not requiring unnecessary JOIN operations, the TDengine data model enables much simpler queries when compared to queries in Timescale.

QuestDB

QuestDB puts tags and metrics together in the table definition. For the connected car scenario, the table is created as follows:

CREATE vehicle(ts TIMESTAMP, vin SYMBOL, brand SYMBOL, model SYMBOL, x DOUBLE, y DOUBLE, z DOUBLE)

If you want to run query on a set of vehicles, the SQL statement is as follows:

SELECT * FROM vehicle WHERE brand = “tesla” AND model = “s3”;

This is simple and straightforward, similar to TDengine. To query data on a specific vehicle, the SQL statement is as follows:

SELECT * FROM vehicle WHERE vin = "vin1";

Putting metrics and tags in a single table is natural for data modeling, but it poses a challenge on storage efficiency and query performance. 

Single Column vs. Multi-Column

Prometheus and OpenTSDB adopt a single column model. Each metric is treated as a different time series. In our connected vehicle case, GPS position (x, y, and z coordinates) requires three time series.

While the single column concept is easy to understand and quite flexible, it consumes significantly more resources. In our connected vehicle example, the x, y, and z coordinates in the position each require a timestamp, which doubles the data size both in RAM and on disk. Reading the GPS position requires reading from three time series, which leads to much longer latency.

InfluxDB, TDengine, Timescale and QuestDB all support a multi-column model. Typically, a time series contains multiple metrics that share the same timestamp. Multi-column support saves resources because the timestamp is stored only once. In addition, it lowers cardinality (in this case, the number of time series), thus reducing metadata size and latency.

QuestDB stores each column as a separate file and thus supports a huge number of columns in a table. Timescale currently supports only 1600 columns while TDengine supports only 4096.

In many scenarios, metrics are connected with each other. For example, in our previous connected car scenario, the x, y, and z coordinates of a GPS position are connected; for a smart power meter, current and voltage are connected; and in DevOps scenarios, inbound and outbound traffic volume are measured at the same time and CPU and memory usage are sampled at the same time. It is illogical to separate these into separate time series.   

In some scenarios, collected metrics may change often, making it difficult to define a schema at the beginning of a project. While single column is a good choice for flexibility, for the majority of scenarios such as GPS position, smart power meters, financial data, and many others, the metrics are defined very clearly. In these cases, multi-column is a better choice to save system resources and reduce latency.

For flexibility and performance, the multi-column and single column models can be used together as is done in TDengine.

Data Types

Prometheus and OpenTSDB only supports numeric data types, while TDengine, Timescale, InfluxDB, and QuestDB support numeric and many other data types.

In some use cases, such as IT infrastructure monitoring, numeric data types are generally sufficient. In most other use cases, however, such as IoT and connected cars, they are simply not enough. These scenarios require Boolean, integer, and string data types, among others.

An advantage of using different data types for different metrics is that different compression algorithms can be used to enhance the compression ratio.

Summary

For developers who are more used to NoSQL models, the tag set model is quite easy to get started with, since it does not require a schema to be defined. But this early simplicity can lead to complexity later in various ways, such as making tags more difficult. As we have seen, the query language is always proprietary and not standardized like SQL. This leads to unnecessary learning costs and lack of reusability. If an application requires more analytical features or third-party tools, the relational data model is a better choice.

We have also seen that the multi-column model can save system resources and reduce latency while the single column model is more flexible to accommodate schema changes. The best practice for a growing IoT world is to combine multi-column and single-column together in a single platform.

As for data types, the requirements are scenario-dependent; if a use case has non-numeric data, it is not feasible to use a TSDB solution that does not support strings or other types.

  • Jeff Tao

    With over three decades of hands-on experience in software development, Jeff has had the privilege of spearheading numerous ventures and initiatives in the tech realm. His passion for open source, technology, and innovation has been the driving force behind his journey.

    As one of the core developers of TDengine, he is deeply committed to pushing the boundaries of time series data platforms. His mission is crystal clear: to architect a high performance, scalable solution in this space and make it accessible, valuable and affordable for everyone, from individual developers and startups to industry giants.