Why general big data platforms fail in IoT scenarios

To process the big data from Internet applications, Hadoop and many tools are emerging from the market. For a typical big data platform, besides HDFS, MapReduce, HBase, Hive and other Hadoop components, it usually includes Kafka or other messaging tools, Redis or other caching tools, Flink or other stream computing tools. For storage, there are many alternatives, like MongoDB, Cassandra, and other NoSQL databases. The typical big data platform is widely used in fraud detection, cyber security, public opinion analysis, customer profiling and more.

It’s natural to use typical big platforms to process data from Internet of Things, Connected Vehicles, or Industrial Internet. Most IoT platforms on the market adopt Hadoop system as its data processing core. For sure, it works, but is it good enough? By looking deeper inside, we can find many drawbacks and a big room for improvement. 

  • Inefficient development: since not a single tool, it requires developers to integrate a set of independent tools. Each tool has its unique interface, development language, and configurations, it takes time for developers to learn. On the other hand, when data flow from one component to another one, keeping the data consistency is not an easy thing to do. Most tools are open source, although there are public forums to discuss the issues found, developers may be blocked by an unknown issue and are deeply stuck there for an indefinite period of time. In general, to build a reliable and efficient big data platform, you need a very good engineering team to develop and maintain it.  
  • Inefficient operation: those popular big data tools are used to process the unstructured data from Internet applications. But the data from connected devices are structured time-series data. To process the structured data using the unstructured data processing tool, it requires more computing and storage resources. For example, a power smart meter is collecting two metrics, voltage and current. If HBase or NoSQL database is used, the row key usually consists of smart meter ID, and static tags. A record is always a key, which is consists of Row Key, Column Family, Column Qualifier, timestamp, and key type, plus the actual metric value collected. To store a record in this way, overhead is big, it requires much more storage space. Also, to analyze the collected data, data record shall be parsed first. For example, to calculate the average voltage, we need to parse the data records saved in Key-Value(KV) format and saved them into an array, then execute the average operations. It requires more computing resources. The big advantage of KV storage is its schema-less feature. The application can insert any data record at its will without defining schema first. For Internet applications, this is an attractive feature, since the application is changed often. But for typical IoT applications, the schema is rarely changed unless the firmware is updated. 
  • High maintenance cost: every tool, like Kafka, HBase, HDFS or Redis, has its backend system and shall be managed. In the traditional information system, the DBA only needs to learn how to manage MySQL or Oracle, but now, the DBA has to learn a lot more to manage, configure, and optimize each tool. In additions, because of many tools involved, locating the root cause of an issue becomes complicated and challenging.  For example, the user found a data record is missing, where this record is lost? It can be caused by Kafka, HBase, Spark or application itself. It usually takes a long time to find out. It requires to put all the logs associated with the same data record from each tool together. With more tools integrated, the whole system becomes less stable. 
  • Time to market is long: since the development and maintenance efficiency not high, it takes a longer time to roll out a new product or service to the market, and the business profit is squeezed. Also, all the open-source software is updating frequently, it requires developer resources to keep them synchronized. If your business is not big enough to support a big engineering team, the cost to use these free big data software is even higher than using professional service or products.
  • For small scale data, the cost for private deployment is high:  For many industry customers, because of the data privacy and security concerns, they always require private deployment instead of the cloud service. For small data scale, the general big data platform is too heavy, the return on investment is not high, so the platform providers usually have two sets of solutions. One set uses a typical big data platform, another set uses only MySQL or other database to handle everything. Two sets of solutions mean higher development and maintenance costs.

The general big data platform has the above issues, is there a better way to handle IoT Big Data? We need to delve deeply into IoT scenarios. After some research, we will find that IoT data has its unique characteristics. 1: it is time-series data; 2: it is structured data; 3: data is rarely updated; 4: There is only one single data source for a device; 5: ratio of read/write is smaller; 6: trend, not data at a specific time, is more important; 7: there is always a retention policy; 8: query is always executed over a time range and a set of devices; 9: real-time analytics is required; 10: traffic is table; 11: special computing, like interpolation, is required; 12: amount of data is huge.

If we utilize the above characteristics, it is possible to design a special and optimized big data platform just for connected devices. This platform shall have the following features. 1: data ingestion rate and query speed can be improved significantly to reduce the cost of computing and storage resources; 2: it shall be linear scalable to support the exponential growth of connected devices; 3: there shall be one single backend management module to make maintenance simpler; 4: it shall provide standard SQL as the interface with C/C++, Python, JDBC, and many other connectors; 5: it shall be open to many other third-party tools, like machine learning, visualization.  

By fully utilizing the IoT data characteristics, TAOS Data, a start-up based in Beijing, is rolling out an open source full-stack platform to process the time-series data. The product, TDengine, outperforms other solutions significantly. It is very promising to become a popular tool in IoT big data processing.