Modern industrial data processing requires not just storage and analytics, but a full-featured industrial data solution. In a typical solution, a time-series database is combined with third-party components for stream processing, caching, and data subscription, and other features as needed.
Because the overall solution has dependencies on these third-party components, complexity of design and difficulty of maintenance become pain points for time-series data processing. In addition, this kind of design requires additional compute and storage resources.
To simplify system design and reduce operation costs, TDengine integrates a time series database (TSDB) with caching, stream processing, and data subscription components that take full advantage of the characteristics of time-series data. This integrated design goes beyond efficient storage and analysis of time-series data to provide a comprehensive and simplified solution for industrial data processing.
Stream Processing
For you to gain insight into your operations and detect errors faster, you need to analyze data points as soon as they arrive at the system. Stream processing is therefore a natural fit for industrial data. Stream processing can be time-driven, producing new results at set intervals (known as continuous query), or data-driven, producing new results whenever a new data point arrives.
Many time-series databases provide a solution for continuous query. Continuous query works well in certain cases, for example in downsampling or in precalculating specific types of expensive queries. However, there are many other situations in which continuous query is intrinsically limited: preprocessing and transformation in scalar functions, session windows, and low-latency use cases like fault detection are all examples of scenarios where continuous query is not up to the task. In these scenarios, tools like Spark and Flink are added into the time-series data solution to provide stream processing. Unfortunately, this results in a complicated system design.
In the TDengine stream processing engine, SQL statements, including user-defined functions, are translated into the pipelines of stream operators. Data is then automatically processed on ingestion and results are output in real time, at specified time periods, or at a specified watermark. This engine makes TDengine the only industrial data solution that supports both time-driven and event-driven stream processing out of the box.
TDengine stream processing provides millisecond-level latency while processing high-throughput events, enabling real time alerting, data transformation, and preprocessing. With TDengine, you can perform aggregations and gain insight into your data fast enough to support real-time dashboards and real-time data analysis. In addition, TDengine can handle out-of-order data by means of user-specified watermarks or automated retrieval of data from the storage engine on demand.
Stream processing in TDengine is intuitive and easy to use, requiring only a few SQL statements.
Caching
Industrial IoT scenarios demand that the system return the latest data to the application as soon as possible. For example, a fleet management system always needs to know the current GPS position of each truck in the fleet. And in a smart factory, the system always needs to know the current state of every valve and the current reading of every meter.
A typical industrial data solution solves this problem by writing new data points into the database and into a caching product, such as Redis, at the same time. The application then retrieves the newest data points from cache instead of the database. While this design works, it necessitates increased system complexity and higher cost of operation.
In TDengine, each vnode is allocated a fixed amount of memory for caching data points. TDengine manages its cache with a first in, first out (FIFO) policy instead of the least recently used (LRU) policy adopted in some systems. This is because it is most important for industrial data applications to have fast access to the newest data in the system.
By caching the newest data, TDengine can retrieve it in milliseconds, eliminating the need to deploy a separate caching system. TDengine also provides the LAST_ROW
function, an extension to standard SQL that retrieves the newest data points. With TDengine, you have access to a simple and straightforward built-in caching system to ensure that your applications get the data they need when they need it.
Data Subscription
The message queue plays an important role in many system architectures. Incoming data points are first written into a message queue and then consumed by other components in the system.
In TDengine, all incoming data points are first written into the write-ahead log (WAL) in append mode. Generally speaking, as the purpose of the WAL is to recover data in the event of a system crash, this WAL file is normally removed once the corresponding data in memory is persisted to the database.
TDengine takes a novel approach in that it does not delete the WAL file immediately, instead keeping it for a specified period during which the WAL file acts as a persistent message queue that can be consumed by other applications.
A topic in TDengine is defined by an SQL statement and can be a database, a supertable, a set of tables, or a single table. Once a topic has new data points, it is pushed to consumers that have subscribed to it. If a database has multiple vnodes (shards), multiple consumers in a consumer group can consume the same topic to boost the data throughput.
The main advantage of TDengine data subscription is that you can perform data filtering. Applications can subscribe to only those data points that meet specific filtering conditions, and other data points are not passed to the application at all. In addition, the application can subscribe to a specific column or set of columns instead of all the columns. For industrial data applications, this makes TDengine more flexible and efficient than third-party components.
Simplified Industrial Data Solution
The following figure shows a classical design for a time-series data solution.
In this architecture, incoming data points are first written into a message queue such as Kafka. The data points in the message queue are consumed and written into a database like HBase or MongoDB for persistent storage. In the meantime, the data points are typically written into Redis for caching and Spark or Flink for stream processing. Time-series applications therefore are required to interact with Redis and Spark in addition to the database.
With its built-in caching, stream processing, and data subscription features, TDengine provides a simplified solution for industrial data processing. TDengine frees you from having to maintain Kafka, Redis, Spark, Flink, and other similar tools and allows you to develop applications without having to take into account the particularities of multiple systems.
This simplified system architecture reduces design complexity and operational costs significantly, which is particularly advantageous on the edge, where resources are most limited. With its integrated components purpose-built for industrial scenarios, TDengine is not just a data historian but a full-fledged industrial data solution.