TDengine Concepts: Schemaless Writing

Mark Wang
Mark Wang
/

In IoT applications, data is collected for many purposes, such as intelligent control, business analysis, and device monitoring, and stored in the time-series database (TSDB). Due to changes in business or functional requirements or changes in device hardware, the application logic and even the data collected may change. Schemaless writing automatically creates storage structures for your data as it is being written to TDengine, so that you do not need to create supertables in advance. When necessary, schemaless writing automatically adds the required columns to ensure that the data written by the user is stored correctly.

The schemaless writing method creates supertables and their corresponding subtables. These are completely indistinguishable from the supertables and subtables created by SQL. You can write data directly to them with SQL statements. Note that the names of tables created by schemaless writing are based on fixed mapping rules for tag values, so they are not explicitly ideographic and they lack readability.

Schemaless Writing Line Protocol

TDengine supports the InfluxDB Line Protocol and the OpenTSDB telnet line protocol and JSON format protocol. However, when using these three protocols, you need to specify in the API the standard of the parsing protocol to be used for the input content.

The following is a description of the TDengine extended protocol, based on InfluxDB’s line protocol. This allows users to control the supertable schema more granularly.

With the following formatting conventions, schemaless writing uses a single string to express a data row (multiple rows can be passed into the writing API at once to enable bulk writing).

measurement,tag_set field_set timestamp

All data in the tag_set field is automatically converted to the NCHAR data type. The data type for items in the field_set field is determined by prefixes and suffixes.

Main Processing Logic

You can use the following rules to generate the subtable names: first, combine the measurement name and the key and value of the label into the next string:

"measurement,tag_key1=tag_value1,tag_key2=tag_value2"

Note that the tag keys are not the original order of the tags entered by the user but the result of using the tag names in ascending order of the strings. Therefore, tag_key1 is not the first tag entered in the line protocol. The string’s MD5 hash value “md5_val” is calculated after the ranking is completed. The calculation result is then combined with the string to generate the table name: “t_md5_val”. “t” is a fixed prefix that every table generated by this mapping relationship has. You can configure smlChildTableName in taos.cfg to specify table names, for example, smlChildTableName=tname. You can run insert st,tname=cpul,t1=4 c1=3 1626006833639000000 and the cpu1 table will be automatically created. Note that if multiple rows have the same tname but different tag_set values, the tag_set of the first row is used to create the table and the others are ignored.

  • If the super table obtained by parsing the line protocol does not exist, this super table is created.
  • If the subtable obtained by the parse line protocol does not exist, Schemaless creates the sub-table according to the subtable name determined in steps 1 or 2.
  • If the specified tag or regular column in the data row does not exist, the corresponding tag or regular column is added to the super table (only incremental).
  • If there are some tag columns or regular columns in the super table that are not specified to take values in a data row, then the values of these columns are set to NULL.
  • For BINARY or NCHAR columns, if the length of the value provided in a data row exceeds the column type limit, the maximum length of characters allowed to be stored in the column is automatically increased (only incremented and not decremented) to ensure complete preservation of the data.
  • Errors encountered throughout the processing will interrupt the writing process and return an error code.

It is assumed that the order of field_set in a supertable is consistent, meaning that the first record contains all fields and subsequent records store fields in the same order. If the order is not consistent, set smlDataFormat in taos.cfg to false. Otherwise, data will be written out of order and a database error will occur.(smlDataFormat in taos.cfg default to false after version of 3.0.1.3)

Time Resolution Recognition

Three specified modes are supported in the schemaless writing process, as follows:

1 SML_LINE_PROTOCOL InfluxDB Line Protocol
2 SML_TELNET_PROTOCOL OpenTSDB file protocol
3 SML_JSON_PROTOCOL OpenTSDB JSON protocol

In InfluxDB line protocol mode, you must specify the precision of the input timestamp. Valid precisions are described in the following table.

1 TSDB_SML_TIMESTAMP_NOT_CONFIGURED Not defined (invalid)
2 TSDB_SML_TIMESTAMP_HOURS Hours
3 TSDB_SML_TIMESTAMP_MINUTES Minutes
4 TSDB_SML_TIMESTAMP_SECONDS Seconds
5 TSDB_SML_TIMESTAMP_MILLI_SECONDS Milliseconds
6 TSDB_SML_TIMESTAMP_MICRO_SECONDS Microseconds
7 TSDB_SML_TIMESTAMP_NANO_SECONDS Nanoseconds

In OpenTSDB file and JSON protocol modes, the precision of the timestamp is determined from its length in the standard OpenTSDB manner. User input is ignored.

Data Model Mapping

This section describes how data in line protocol is mapped to a schema. The data measurement in each line is mapped to a supertable name. The tag name in tag_set is the tag name in the schema, and the name in field_set is the column name in the schema. The following example shows how data is mapped:

st,t1=3,t2=4,t3=t3 c1=3i64,c3="passit",c2=false,c4=4f64 1626006833639000000

This row is mapped to a supertable: st contains three NCHAR tags: t1, t2, and t3. Five columns are created: ts (timestamp), c1 (bigint), c3 (binary), c2 (bool), and c4 (bigint). The following SQL statement is generated:

create stable st (_ts timestamp, c1 bigint, c2 bool, c3 binary(6), c4 bigint) tags(t1 nchar(1), t2 nchar(1), t3 nchar(2))

Processing Schema Changes

This section describes the impact on the schema caused by different data being written.

If you use line protocol to write to a specific tag field and then later change the field type, a schema error will occur. This triggers an error on the write API. This is shown as follows:

st,t1=3,t2=4,t3=t3 c1=3i64,c3="passit",c2=false,c4=4 1626006833639000000
st,t1=3,t2=4,t3=t3 c1=3i64,c3="passit",c2=false,c4=4i 1626006833640000000

The first row defines c4 as a double. However, in the second row, the suffix indicates that the value of c4 is a bigint. This causes schemaless writing to throw an error.

An error also occurs if data input into a binary column exceeds the defined length of the column.

st,t1=3,t2=4,t3=t3 c1=3i64,c5="pass" 1626006833639000000
st,t1=3,t2=4,t3=t3 c1=3i64,c5="passit" 1626006833640000000

The first row defines c5 as a binary(4). but the second row writes 6 bytes to it. This means that the length of the binary column must be expanded to contain the data.

st,t1=3,t2=4,t3=t3 c1=3i64 1626006833639000000
st,t1=3,t2=4,t3=t3 c1=3i64,c6="passit" 1626006833640000000

The preceding data includes a new entry, c6, with type binary(6). When this occurs, a new column c6 with type binary(6) is added automatically.

Write Integrity

TDengine guarantees the idempotency of data writes. This means that you can repeatedly call the API to perform write operations with bad data. However, TDengine does not guarantee the atomicity of multi-row writes. In a multi-row write, some data may be written successfully and other data unsuccessfully.

For more about schemaless writing, see the official documentation.