Why Big Data Platforms Fail in IoT Scenarios

Jeff Tao

June 6, 2021 / Engineering

With the growth of Internet applications over the past decade, a number of big data platforms have emerged to process the large-scale data that they generate. Hadoop is one of the more popular choices for these scenarios, comprising core components such as HDFS and MapReduce as well as an ecosystem of tools — HBase, Hive, and more. Big data platforms like Hadoop are used with much success in scenarios like fraud detection, cyber security, public opinion analysis, and customer profiling.

Considering their effectiveness in Internet applications, it may seem natural to deploy these big data platforms to process IoT data for connected cars or industrial scenarios like smart manufacturing. And to be fair, it is possible to process IIoT data with a system like Hadoop. But this arrangement leaves much to be desired in terms of efficiency and costs.

Inefficient Development

Big data platforms require developers to integrate a set of independent tools, each of which has its own interface, development language, and configuration. This makes the system more difficult for developers to learn and often requires larger teams including experts in the various tools in the ecosystem. Moreover, when data flows from one component to another, maintaining data consistency is not an easy task. While most tools are open source and there are public forums to discuss any issues found, developers may be blocked by an unforeseen issue with no definite time for resolution.

In general, to build a reliable and efficient big data platform, a top-notch engineering team is required to develop and maintain it, which is often impossible for industrial and other non-tech enterprises.

Inefficient Operations

The data being generated by most Internet applications is unstructured, and most big data platforms are designed to process this data. Data from IoT devices, on the other hand, is typically structured time-series data. Processing structured data with unstructured data tools is a waste of compute and storage resources, to put it simply.

For example, consider a smart meter collecting two metrics — voltage and current. In HBase or a NoSQL database, the row key usually consists of the smart meter ID and static tags. A record is always a key, which consists of the row key, column family, column qualifier, timestamp, and key type — plus the actual metric value collected. This method of storing records incurs a large overhead in terms of storage space.

Furthermore, the advantages of key-value storage for quickly iterating Internet applications do not apply to IoT devices, whose data structure changes rarely (upon firmware upgrade), if ever. Before the database can perform simple tasks like calculating an average value, key-value records need to be parsed and saved into an array, requiring additional compute resources for no benefit.

High O&M Costs

Each tool in a big data platform — HBase and HDFS as well as auxiliary components like Kafka and Redis — has its own backend system that needs to be managed and monitored. Compared with a traditional information management system, in which a DBA is responsible only for managing a MySQL or Oracle deployment, the requirements in terms of management, configuration, and optimization placed on system administrators are massive.

In addition, because of the many tools involved, locating the root cause of an issue often becomes complicated and challenging. For example, imagine that a user reports that a data record is missing. It can take a long time to determine the location of the fault, considering that it could be caused by Kafka, HBase, Spark, or the application itself. System administrators are forced to somehow bring all the logs associated with the data record together, though these logs are generated by a number of different components. In the end, the more independent tools in the platform, the less stable the system becomes as a whole.

Owing to the low efficiency of development and maintenance mentioned previously, it takes a longer time to roll out a new product or service, which can affect profit margins. At the same time, the open-source components of the ecosystem are being updated frequently, requiring developer resources to keep them synchronized. If an enterprise is not big enough to support a high-end engineering team, the cost to use a “free” big data platform can be higher than simply using commercial products or services.

Time Series Database: The Solution for IoT

We can see that big data platforms, while a good fit for Internet applications, are not suited to IoT and industrial scenarios. For optimal performance and efficiency, these scenarios require a new data processing system — a time-series database — to be developed based on the characteristics of time-series data. This purpose-built time-series database requires the following features:

Significantly improved data ingestion and query rate to reduce compute and storage costs
Linearly scalable architecture to support the exponential growth of connected devices
Unified backend management module to simplify maintenance
Standard SQL as the query language with libraries for C/C++, Python, and JDBC
Open ecosystem for integration with third-party tools like machine learning, visualization

The core component of TDengine is a database purpose-built for time-series data that meets these five requirements. Its unique “one table per device” architecture and the innovative concept of supertables deliver higher performance in data ingestion, querying, and compression rates compared with general-purpose data systems as well as competing time-series databases. Its distributed cloud-native design enables high scalability, and its integration of essential components such as stream processing, data subscription, and caching forms a simplified solution for time-series data processing. And most importantly, TDengine makes it easy to get started with a free trial of TDengine Cloud and free downloads of TDengine OSS.

Jeff Tao
With over three decades of hands-on experience in software development, Jeff has had the privilege of spearheading numerous ventures and initiatives in the tech realm. His passion for open source, technology, and innovation has been the driving force behind his journey.

As one of the core developers of TDengine, he is deeply committed to pushing the boundaries of time series data platforms. His mission is crystal clear: to architect a high performance, scalable solution in this space and make it accessible, valuable and affordable for everyone, from individual developers and startups to industry giants.