In the Industrial IoT world, the value of a dataset increases with its size: the more data that your applications and algorithms can access, the more accurate their results can be and the more your organization can benefit. But the massive datasets that IIoT devices generate have to be stored somewhere, and the cost of that storage can be astronomical for larger enterprises.
Fortunately, the time-series data typically found in IIoT scenarios is particularly suitable for compression. TDengine can compress datasets to a fraction of their original sizes — sometimes as low as 10% — without significantly affecting processing performance. This is due to several factors.
Column-oriented Storage
Time-series databases are used to store metrics generated by devices over time. Typically, these metrics are collected at short intervals, between one per second and one per minute in most cases. This means that the value of each data point is very similar to the surrounding points. Consider a temperature measuring component on a device that takes a reading every 15 seconds. It is unlikely that there will be drastic changes in temperature in such a short interval, and the total range of temperatures will be relatively consistent in the absence of anomalies.
Unlike general-purpose databases like MySQL that use row-oriented storage, TDengine and other time-series databases store data by column. In this way, similar data — recorded measurements and timestamps over a period of time — are stored contiguously. Because this similar data is stored together, it is highly compressible even without advanced compression algorithms.
Two-Stage Compression
TDengine uses a two-stage compression in which data is first encoded and then compressed. By default, TDengine encodes data as follows:
- Smaller integer data types (INT, TINYINT, etc.) use simple8b encoding.
- Larger integer data types (BIGINT and UBIGINT) and timestamps use delta encoding.
- Floating-point data types (FLOAT and DOUBLE) use delta-of-delta encoding.
Delta encoding is highly efficient in time-series scenarios because adjacent data points typically are very similar; for example, in a scenario where data is generated every second, the difference (delta) between each timestamp is naturally 1 second. This difference is much smaller than the actual values in the timestamp column.
After encoding, all data types are compressed with standard data compression methods. TDengine uses lz4 for all columns by default, but you can configure specific columns to be compressed using zlib, zstd, or xz if desired. Depending on the nature of your data, you may find better results with a certain algorithm. In addition, you can configure lossy compression (tsz) on floating-point columns for additional savings.
One Table per Device
TDengine is unique among time-series databases in that it uses a “one table per device” model where a separate table is created for each data collection point. This model enables TDengine to offer even higher compression rates than other TSDBs, because each block of data contains only the records for a single table (i.e. a single device). Consider two devices, one recording temperatures in degrees Fahrenheit and another in degrees Celsius. If the data records of these two devices are stored within the same table, and therefore data blocks contains records from both devices, the overall compressibility of the data is reduced because the data is dissimilar. By associating each device with an independent table, TDengine ensures the highest possible compression ratios for the data generated by all devices.