With the big cut of the communication cost, and the booming of all kinds of sensors and smart devices, all the devices, from the wristband, bicycle, car, meter to the elevator, CNC machine, oil well, excavator and many others, are all generating a huge amount of data successively. The collected data are used to monitor the running status of devices, generate operation reports, raise alarms or make predictions.
The data collected by sensors and devices is massive in scale. Fortunately, data generated by these devices have special characteristics compared with typical Internet applications. If we utilize these characteristics, we can design an efficient way to process them. This paper summarizes these special characteristics.
- Time-series data: triggered by a predefined timer or event, the connected devices generate a sequence of discrete-time data records successively. The timestamp associated with each data record is the key for computing or analysis. The data records are indexed in timestamp.
- Structured data: the data from web sprawler, twitter, social network and many Internet applications are unstructured data, they may be text, pictures or videos. But for connected devices, the data generated are all structured. They have predefined data types or fixed length. For example, we can use 4 bytes float to record the voltage or current in a smart meter.
- No updates on data: there are rare occasions to modify or delete the collected data from connected devices. It is a kind of log. But for typical Internet applications, data records are subject to changes all the time, and update or delete operations are mandatory for the underlying data engine.
- Single data source: for a specific device, the data records from this device can only be generated by this device. In other words, there is only one single data source for a device. Also, the data records generated by one device are independent of the data records from other devices.
- The ratio of read/write is smaller: for a typical Internet application, like twitter, a data record is always read by many users although it is written by a single user. But for connected devices, the raw data is always read by analysis software or other tools. Few people will read them unless a deep investigation is needed. Compared with the Internet application, its ratio of read/write is much smaller.
- The trend is more important: for bank records, tweets, social network updates, every record is important. But for data records generated by devices, there are fluctuations on the metric monitored, but the changes are not that big for two successive time points. People like to pay more attention to the trend of data in a time range, for example, the trend in the past hour, not the value at a specific time point.
- Retention policy: for cost reason, collected data won’t be stored forever, there is always a retention policy. The data shall be removed once its lifetime is passed.
- Aggregation over time or a set of devices: most queries are applied on a specific time range, not all the historical data. In addition, data aggregation is always needed over a time window, over all or a subset of the devices. Filtering a subset of devices for aggregation is mandatory in IoT applications.
- Real-time computing or analysis is required: for most Internet applications, data analysis is can be done off-line. For example, the user portrait is extracted from his behavior data collected, but it is not necessary to analyze the data right away. It can be analyzed in batch. But for IoT application, to prevent the business loss, the real-time analysis or computing is always needed to monitor the status, and raise the alarm.
- Traffic is stable: given the number of devices and the sampling period, the data traffic can be calculated and predicted. The traffic is pretty stable. For Internet applications, the traffic is much hard to predict. For example, around 11/11 in China, traffic on taobao.com, jd.com can jump 10 times higher.
- Special computing is needed: for lots of IoT scenarios, general computing on data is not enough. For example, the user wants to check the status of a device at a specific time point, but at that time point, there are no data points, and the data point was collected a few seconds earlier or later. The system needs to provide interpolation to solve this problem.
- Data volume is huge: although it seems simple, data volume is huge. In China, every power smart meter is generating a data record every 15 minutes, 96 records a day, but there are over 500 million smart meters in the whole country. Every connected car is generating a data record every 10 to 15 seconds, 200 million cars in China can generate over 200 billion records in a single day.
The data from connected devices is a stream. A single data record on a specific time point is not that important, and we can still get the same analysis result even some data points are lost. It seems an easy thing to do. But because the amount of data is huge, it poses a big challenge to store and analyze them. For the traditional relational database and NoSQL database, they are not efficient enough to process this huge amount of data, so more and more computing and storage resources are required. The overall data processing cost is very high.
Learn more about time-series databases: