Why Redis, Kafka, Spark aren’t Needed if TDengine is Used in the IoT Platform?

TDengine is an open-sourced full-stack platform for time-series data processing. By utilizing the data characteristics of time-series data, TDengine outperforms the general database in terms of data ingestion rate, query speed, computing and storage resources. The core functionality of TDengine is a database, but a big data platform always includes message queuing, caching, and stream computing components. Why Redis, Kafka, Spark, and similar tools are not needed if TDengine is used? This blog brings some insight into this question.

Message Queue

Unlike internet application, in IoT scenarios, given the number of connected devices and the data sampling rate, the data traffic needed can be estimated roughly. For 11/11 shopping festival in China, the traffic on Taobao.com, jd.com may jump 10 times higher. But for IoT, the traffic is pretty stable. With more devices added to the platform, more traffic will be generated linearly. Also, the connected device always has its buffer to store the data in case the network is not reachable, so the message queuing in IoT platform is not mandatory. TDengine provides a simple way to queue the data records received. The received records will be written into a log file, they won’t be lost as long as an acknowledgment is received. At the meantime, TDengine provides a subscription interface to allow the application to consume the newly arrived data. The application can subscribe to the raw data, or aggregation over time, over a set of devices or both. With this design, in a typical IoT platform, Kafka or similar message queuing tools are not needed anymore.

To publish a data record using TDengine, the application just needs to use standard SQL command INSERT. For subscription, the C/C++ API is: 

taos_subscribe(char *host, char *user, char *pass, char *db, char *table, long time, int mseconds)

Where host, user, pass, db and table are the parameters to access a table in a database, parameter mseconds controls how often to poll the database for new records, and parameter time controls how long from now the old records shall be retrieved. For APIs in Java or other languages, please refer TDengine user manual.

Caching

TDengine allocates a fixed-size buffer in memory, the newly arrived data will be written into the buffer first. Every device or table gets one or more memory blocks. For a typical Internet application, like twitter, content buffering depends on the end-user behavior. If many users access content, the content shall be buffered. If no user accesses the content, it shall not be buffered at all. But for typical IoT scenarios, the hot data shall always be newly arrived data, they are more important for timely analysis. Based on this observation, TDengine manages the cache blocks in a FIFO way. If no enough space in the buffer, the oldest data will be saved into hard disk first, then be overwritten by newly arrived data. TDengine also guarantees every device can keep at least one block of data in the buffer.

By this design, the application can retrieve the latest data from each device super-fast, since they are all available in memory. TDengine provides a special SQL function last_row to return the last data record. If super table is used, it can be used to return the last data records of all or a subset of devices. For example, to retrieve the latest temperature from thermometers in located Beijing, execute the following SQL

select last_row(*) from thermometers where location=’beijing’

With this caching feature, Redis or similar tools are not needed in IoT platform.

Stream Computing

IoT data is a kind of data stream. It usually requires real-time computing to monitor status, raise alarm, or generate real time reports. Based on sliding window algorithm, TDengine can execute queries in the background periodically. The continuous query provides simple implementation for stream computing to make real-time aggregation easier. For example, every thermometer is collecting the temperature every 10 seconds, and the system requires to calculate the average temperature in the past 3 minutes window every 1 minute, you only need to execute the following SQL:

select avg(degree) from thermometer interval(3m) sliding(1m)

Where parameter interval is the size of sliding-window, and parameter sliding controls how often the window moves. For most IoT scenarios, Spark or similar tools are not needed.

Full stack for time-series data processing

As discussed above, TDengine provides the caching, message queuing, and stream computing required for a big data platform. But for a typical IoT, connected cars scenarios, there are always many business data. Those business data are not time-series, and they often have a complex relationship. TDengine cannot handle that data. These business data shall be processed by traditional databases, for example, MySQL. Fortunately, compared with time-series data, the volume of business data is very small, less than 1%, so big data platform is not mandatory. 

In the IoT, connected cars, Industrial Internet scenarios, time-series data accounts for at least 99% for the whole data in a business. By adopting TDengine as the data processing engine, Kafka, HDFS, HBase, Spark, Redis, and other big data tools are not needed anymore. TDengine simplifies the system architecture, reduces the complexity of development and operations, and reduces the total ownership cost. Also, it makes the system more robust.

For private deployment in IoT scenarios, TDengine shows another big advantage. Because it takes only seconds from download, to install and run TDengine successfully, only seconds to add a new node into a TDengine cluster. There is no difference in the handling of historical data and real-time data, and the data can be replicated to other nodes automatically in real-time. Compared with the general big data platform, TDengine reduces the complexity of deployment and maintenance significantly.