TDengine 3.0 Storage Engine

Hongze Cheng
Hongze Cheng
/

Compared with the previous two versions, the 3.0 storage engine focuses more on storage and query efficiency in various scenarios, not only to further reduce the burden of management nodes, provide efficient and reasonable update and deletion functions, support data backup, stream processing and other functions, but also to consider the efficient processing of data dimensional expansion, multi-table scenarios of boot It should also take into account the needs of efficient processing under the expansion of data dimensions, boot-up speed under multi-table scenarios, and reasonable, efficient and accurate use of system resources.

The updates to the TDengine 3.0 storage engine can be divided into three main parts: TQ, the WAL-based message queue; META, the TDB-based metadata storage engine; and TSDB, the LSM-like storage engine used to store temporal data (TSDB SE).

Message Queue Storage Engine

In 1.0 and 2.0 we provided a continuous query function based on TSDB storage, and the idea was to use it instead of stream processing. The workflow can be summarized as follows: the app sends a query task down at regular intervals, the query engine executes the query, the TSDB returns the result to the query engine, and the query engine returns the result to the app.

The advantage of this feature is that the query engine can be reused, simple and easy to develop, but the disadvantage is also obvious, not only will there be poor computation of real-time, query pressure resulting in a waste of computing power, more importantly, the problem of disorder can not be handled. The data will be sorted according to the timestamp after entering TSDB, a disordered data comes in and inserted to the front, TSDB is unable to launch this data, which will lead to the problem of disorder; in addition, because the query engine is reused, the query engine of TSDB will not process the new disordered data and change the result verification.

From this technical background, a storage engine needs to be designed in TDengine 3.0 to support message queues and streaming computation. This storage engine can tell us what kind of data is incremental, so that the streaming computation only needs to handle incremental data and leave the rest of the data alone.

In addition, the storage engine needs to be built on top of a Pipe to ensure consistent data entry and exit order. During the design, we found that TDengine’s WAL is actually a natural Pipe, so we added a layer of indexes on top of the WAL and did a lot of adaptation development to implement the TQ storage engine.

If you have studied TDengine’s model, you will find that its architectural model corresponds to many of Kafka’s designs. The supertable is similar to Kafka’s Topic, the Vnode is close to the Partition in Kafka, and the table name of the sub-table corresponds to the Event Key in Kafka. Therefore, this architecture is naturally designed with the characteristics of a message queue, and from this point of view, it is very easy for TDengine 3.0 to implement a message queue.

Based on the TQ storage engine, in practice, the query engine only processes the incremental data, corrects the computation results and returns them to the App, and does not re-query the full amount of data. It brings the advantage of very high real-time performance, as incremental data can be clearly distinguished and disordered data can be processed efficiently, while saving more computational resources to correct the computation results.

Metadata Storage Engine

In the metadata storage area, the previous 1.0 and 2.0 adopted a relatively simple storage mechanism, i.e. full memory storage, where data is stored in memory as hash tables, supplemented by jump table indexes, and this hash table has a Backup Storage Engine, which ensures the persistence of data. The advantage of this method is full memory and high efficiency, but the disadvantage is also obvious, when the startup, this part of the data will be loaded all into the memory, not only the memory occupation can not be precisely controlled, but also lead to a long boot time.

To solve these problems, in 3.0 we developed TDB (a B+ tree storage engine) to store metadata and metadata indexes. TDB’s B+ tree storage is suitable for metadata reading and writing scenarios, and can support ten billion timelines of storage, avoiding full memory storage and long loading time of metadata, and solving the problem of table expansion with limited memory. If you are interested in how TDB is implemented, you can take a look at the source code on GitHub (https://github.com/taosdata/TDengine).

The advantages of TDB are that memory can be precisely controlled, boot up is fast, and a large amount of metadata can be stored with limited memory.

Time-series data storage engine

Update and deletion of chronological data

In 2.0, the update and delete functions were developed after the engine was developed, so the update and delete functions in 2.0 are relatively simple but weak. 2.0 updates are based on a distribution of timestamps on the horizontal axis, and the operation of updating data is to append the data with the same timestamp after it. The query engine then merges these disordered data to get the updated result. The implementation of delete is simpler, similar to physical deletion, the data to be deleted will be directly “killed” in memory, hard disk, relatively inefficient.

TDengine 3.0 completely abandons the update and delete mechanism of 2.0 and considers the implementation of updates and deletes at the design level, introducing version numbers, turning timing data into points on a two-dimensional graph, with each write request carrying a version number, which is incremented in the order in which the write requests are processed.

So how does 3.0 do updates exactly? As you can see above, the blue dots are the data you want to update, and the version number of the data must be larger than the version number of the old data to be updated, so we introduced the version number mechanism. When the timestamps are the same, the data with the larger version number will update the data with the smaller version number, because the data with the larger version number is the later written data and is relatively “new”.

Previously the data in each table, whether in memory or on the hard disk, was sorted by timestamp, but with the introduction of version numbers the sorting rules have been modified. First of all, it is still sorted by timestamp, and in the case of the same timestamp it has to be sorted by version number. With this sorting process, we can process the data updates in a near-append fashion, and the query engine is responsible for merging and sorting the final data to get the final result.

In 3.0, the deletion mechanism for chronological data is also completely redone. Compared to 2.0, 3.0 also supports significantly more filter conditions, such as where tag, where timestamp, etc. How is this implemented at the bottom? First of all, it is still based on the version number mechanism.

For a delete operation, we need to record the start and end time interval, as well as the version number of the delete request, as shown above, a delete request corresponds to a purple rectangle on a two-dimensional graph, and all points inside this rectangle are deleted. In 3.0, temporal data deletion appends a tuple of (st, et, version) records, and at query time, the query engine does a final merge of the written data and the deleted record tuple, and gets the final result of the deletion.

With this mechanism, the delete function becomes relatively simple for write operations, but becomes more complex for queries. It is quite time consuming for a query to determine if a record is deleted when merging data, i.e., to check if the record is inside all the deletion intervals (rectangles).

TDengine 3.0 uses the Skyline algorithm to improve the speed of queries with deleted data. The essence of this algorithm is to construct a point data that is used to replace all the deleted intervals, as in the figure above where the two rectangular deleted areas can be represented by three points. The original algorithm of checking for each row to see if it is deleted is now a one-way scan and filter operation, thus greatly improving query speed.

Storage optimization in multi-table scenarios

In a time-series data scenario, massive amounts of data represent potentially massive amounts of tables as well. In some business scenarios the number of tables is very large, but the data collected is very small, for example, there are 10 million tables, but only two data are collected per table per day, this scenario is not very friendly to the 2.0 storage structure.

In TDengine 2.0, a table will fall on the hard disk and there will be only one table of data in a data block; if there are 10 million tables with two data per table, there will be only two records in a data block, which cannot be compressed, and compression in this case will lead to data bloat.

TDengine shares the same schema for all sub-tables under the super-table, so that in the last file we can merge data from different sub-tables under the same super-table into one data block – originally there were only two records in one table, but if we can merge two records from a hundred tables into one data block, then there would be two hundred records. But this may require reading the table a hundred times, so if we just want to read it once how would we do that?

In the last file of 3.0, the data in the data block will be sorted by UID, Timestamp, Version, i.e. compare UID and UID first, compare timestamp if they are the same, compare version number if timestamp is the same, so it can be more effective to This way, we can handle multi-table low-frequency scenarios more efficiently.

Author

  • Hongze Cheng
    Hongze Cheng

    Hongze Cheng is Co-Founder and Storage Engine Architect at TDengine and is responsible for designing the system structure of TDengine 3.0 and implementing the storage engine. Additionally, he is in charge of the design of TDengine’s next-generation storage. He joined TDengine after graduating from the University of Michigan with two master's degrees.