There are many IoT Big Data Processing tools, the most popular one is Hadoop. Hadoop is an ecosystem, it includes HDFS, HBase, Hive, YARN, Storm, Spark, Zookeeper and other tools. In a typical big data platform, besides Hadoop, it usually has other components, like Kafka for message queueing, Redis for caching. These tools together work well on most Internet applications. But for the Internet of Things, Connected Cars and Industrial Internet, the data volume is huge and it poses a technology challenge to process them efficiently. Fortunately, IoT data has some common characteristics. If these characteristics are utilized, a special big data platform can be designed and optimized, significantly improving the performance.
TDengine is this kind of special big data platform designed and optimized for IoT scenarios. Its core functionality is a time-series database. To reduce the complexity and cost of development and operation further, TDengine also provides caching, message queuing and stream computing required by big data platform. Compared with other solutions, TDengine shows many advantages.
1: 10X Faster on Insert/Query Speed
IoT data is time-series structured data, instead of the popular Key-Value storage, TDengine adopts structured data storage. In IoT scenarios, each connected device has only one data source, and the generated data is a sequence of data points ordered by timestamp. Also, the user always pays attention to the trend of data collected in a time window, instead of the value at a specific time. Based on these characteristics, TDengine requires the application to create a dedicated table for each device. For example, if there are 10 million smart meters, 10 million tables shall be created.
With “one table one device” design, the data records generated from one device can be stored continuously in blocks in memory or hard disk. To retrieve data from a single device in a time window, it shall be super-fast since one read operation can retrieve a batch of data records. On the other hand, since network latency is hard to control, data records from different devices may arrive at the server out of order. But for the data record from the same device, it shall arrive at the server in order no matter how big the network latency is. Then inserting a new data record into a table is becoming an appending operation, and the data ingestion rate can be improved significantly.
Schemaless feature is the big advantage of Key-Value store, since every data record may have different data structure. Any data record can be inserted without the definition of schema first. But for IoT, and connected cars scenarios, the structure of data generated by the devices is rarely changed unless the firmware is updated, so KV store does not bring much benefits to IoT data processing. Also, TDengine designed an innovated way to allow the application to change the schema at the minimum impact on performance and cost.
2: Reduce the computing and storage resources significantly
Since the performance of insert/query is 10x higher, the system requires much less computing resources. In the meantime, it requires much less storage space.
The value of the collected metric from a device fluctuates, but it changes bit by bit, not sharp between two adjacent time points. TDengine adopts the column-based storage, and the data points of the same metric from the same device are stored continuously, thus it is much easier to gain a higher compression ratio. In addition, TDengine applies different compression algorithm to different data types, like delta-delta coding, simple 8B, zig-zag and LZ4, to gain more compression ratio. Compared with the general database, TDengine only uses less than 1/5 storage spaces.
3: Reduce the complexity of system architecture
Unlike internet application, in IoT scenarios, given the number of connected devices and the data sampling rate, the data traffic needed can be estimated roughly. For 11/11 shopping festival in China, the traffic on Taobao.com, jd.com may jump 10 times higher. But for IoT, the traffic is pretty stable. With more devices added to the platform, more traffic will be generated linearly. Also, the connected device always has its own buffer to store the data in case the network is not reachable, so the message queuing in IoT platform is not mandatory. TDengine provides a simple way to queue the data records received and provides a subscription interface to allow the application to consume the newly arrived data. It is not necessary to use Kafka or other message queuing tools if TDengine is used.
TDengine allocates a fixed-size buffer in memory, the newly arrived data will be written into the buffer first. The buffer is managed in a FIFO way. If no enough space in the buffer, the oldest data will be saved into hard disk first, then be overwritten. TDengine also guarantees every device can keep at least one block of data in the buffer. By this design, the application can retrieve the latest data from each device very fast, since they are all available in memory. It is not necessary to use Redis or other caching tools if TDengine is used.
Time-series data is a stream. Based on sliding-window algorithm, TDengine execute queries on a specified time window periodically. The queries can be aggregation over the time window, over a set of devices or both. This continuous queries provided by TDengine can replace the stream computing, like Spark, required in most IoT scenarios.
TDengine provides full-stack for time-series data processing, including database, caching, message queuing, and stream computing. If TDengine is used in IoT platform, it is not necessary to use Kafka, HDFS, HBase, Spark, Redis, and other tools anymore. The overall complexity of development and operation is reduced significantly. The data consistency can be easily kept, and the system is more robust.
4: Powerful data analysis
TDengine hides the boundary of data in the cache, hard disk or network storage from applications. To retrieve or analyze the data, the application does not need to know where the data is located as long as the time range is specified, no matter it is 10 years ago or one second ago. It makes the application simpler.
The data from the same device is stored block by block. To make aggregation faster, the system pre-aggregates each block. With pre-aggregation, it is possible to get aggregation results without scanning the raw data, and performance can be improved significantly. Even when raw data has to be scanned, it is still fast since one single IO operation can retrieve a big batch of data records. Also, the data are stored in a structured way, the uncompressed data will be written into an array directly without any parsing. Compared with the NoSQL database, the performance of analysis is much higher.
To aggregate the data over a set of devices, TDengine introduces a new concept, Super Table. It allows the user to associate one or more static tags with each table or device. By specifying the tag values, the system can filter out the devices for aggregation. In addition, TDengine implemented a special algorithm to scan a data file only once to reduce the IO operations when aggregating multiple devices.
To make data analyst work simpler, TDengine allows the user to execute ad hoc queries via TDengine shell, Python, R or matlab. It is an ideal data warehouse tool for IoT big data.
5: Zero management, no learning curve
TDengine installation package is only 1.5M, it takes only seconds from download to install it successfully. For enterprise edition, it takes seconds to add a new node into a cluster. There is no manual partitioning on tables or databases, and the data can be replicated to other nodes automatically in real-time. Compared with the general database, the management is much simpler. TDengine adopts standard SQL as its query language with C/C++, Java(JDBC), Python, Go and RESTful connectors. It works like MySQL, there is no learning curve.
6: Connect with other tools seamlessly
To make data collection simpler, TDengine supports Telegraf and Kafka. MQTT and OPC will be supported soon. In the application side, Grafana, a visualization tool, is supported. Matlab, R, and some BI tools are already supported. Since TDengine supports JDBC, more third-party tools can be connected seamlessly.
For DevOps scenarios, as long as Telegraf, Grafana and TDengine are configured appropriately, an IT infrastructure or application monitoring platform can be set up quickly without a line of code.
7: Open source
Independent of any other tools, TDengine is developed by TAOS Data from ground zero. Since its first commercial release in August 2018, it gained a dozen paid customers in the area of the power grid, CNC machine, smart city and connected cars. To promote TDengine globally and quickly, TAOS Data open-sourced its core storage engine and computing engine in this July. TDengine community edition is able to meet the requirements of small to medium scale of IoT, Industrial Internet, Connected Cars applications.
TDengine outperforms other time-series data solutions in terms of functionalities, performance, and usability. By adopting TDengine, it will be much easier to set up the big data platform for Internet of Things, Connected Cars, Industrial Internet, and DevOps. It can reduce the computing and storage resources, reduce the complexity of development and operations, and reduce the human resourced required. The total ownership cost will be reduced significantly.
Since it is open sourced, just download it and have a try.