Environmental monitoring and data governance at scale

GESRI created a project that would help enterprises use big data technology to manage their environmental management and governance goals and comply with environmental regulations. Instead of leaving enterprises to create their own platform to solve the issues of data governance, quality management and associated tasks like data cleaning, ETL, aggregation and so on, GESRI’s platform would provide the basis for accurate governance and intelligent decision making.

The real-time data warehouse structure of this project is shown in the figure below. The status of environmental quality and real-time pollution discharge, in air, surface water, coastal waters, fixed and mobile pollution sources, can be captured in real-time. The real-time monitoring data can be further analyzed for early warning of possible harmful environmental events.

Establishing a real-time data warehouse of environmental data and exposing centralized collection, aggregation and analysis services, is more efficient than having various organizations build the same software platforms. This promotes software reuse and also promotes best practices and standards.

From a relational database to TDengine

Since this is a real-time data warehouse, the database solution required for it has much higher performance requirements than those of a general database application. The storage layers must provide extremely high write and read performance with very low latency. Due to the vast amounts of data, it must also provide very efficient and compact storage.

In our case, there are nearly 5000 monitoring stations and each station monitors 3-5 factors of key pollutant discharge units every 30 seconds. For many years, we used a relational databases for data storage and were only able to keep data for 3-5 days and had to delete old data on a daily basis. We considered using PostgreSQL’s TimescaleDB extension, but it could not meet certain regulations having to do with monitoring and control of individual devices. We also tried to use another time-series database (TSDB) product, but its performance and general interface were still far from our requirements.

When we first came into contact with TDengine, we were a little hesitant because the product was so new. Fortunately, TDengine is easy to install and has complete documentation. Even though it is an open source product, product support is very comprehensive. There is also a dedicated technical group on social media. When problems arose, we were always able to get help from TDengine support staff and the TDengine community. Our entire evaluation process was very smooth. Finally, after a long period of research and testing, we chose the open source time series database, TDengine.

Deployment and performance of TDengine

For the deployment of TDengine, we chose three, 4-core servers with 8G RAM each, one 8-core server with 16G RAM and one 8-core server with 32G RAM for a total of 5 servers. These servers were configured as a 5-node single-copy cluster. There are three business lines that have been put into operation. These include the following:

  • Automatic monitoring system of key pollutant discharge units every 30 seconds.
  • Real-time data reporting of OBD for heavy-duty diesel vehicles
  • Real-time monitoring data reporting of VOCs polluting enterprises.

We also perform automatic monitoring of data reporting services, as well as data analysis and data application services.

From April 2021 to the present (January 2022), the project has created 4 supertables and 67,453 subtables.

Dnodes in the TDengine cluster
Cluster status
Table schema for the project

To illustrate the storage process, we take the 30-second monitoring data of exhaust gas emissions from key polluting enterprises as an example.

The st_om_rtd_gas table is a super table we created for waste gas treatment. It stores 7.65 billion pieces of data (a total of 16 billion data in the four super tables), scattered in 19,419 sub-tables, with an average of 390,000 rows per table. Due to the characteristics of TDengine super table, coupled with columnar storage and ultra-high compression capability, these data only occupy 240G of storage.  Not only does this significantly reduce our storage space and costs, but also lays a good foundation for data query performance.

In terms of queries, our main logic is to obtain the recent maximum series value, filter and obtain a list of monitoring factors that have exceeded limits, and then query the hourly, daily and monthly maxima through downsampling. We can then query specific discharge outlets to determine the list of discharge outlets with high pollutant discharge concentration or flow rate for a period of time, and then analyze each discharge concentration within a specific time range to obtain detailed information on abnormal discharges. TDengine has successfully completed the above query analysis process with excellent performance.

Here are examples of specific queries:

SELECT top(val_zs, 20) FROM st_om_rtd_gas
WHERE moni_time >= NOW - 2h
GROUP BY pol_code;
Latency of the previous query

For the 7.6 billion row hypertable, the grouped TOP query took only 0.2 seconds.

SELECT MAX(val_zs) FROM st_om_rtd_gas
WHERE moni_time >= NOW -2h AND pol_code in ('002’,'‘010')
INTERVAL(1h) GROUP BY ps_code,p_code,pol_code;
Latency of the previous query

Based on TDengine returning 2,968 rows, it took only 0.06 seconds.

Now we extract the list of suspicious discharge outlets from the returned results to check the specific discharge situation:

SELECT val, val_zs, ps_name, p_name, pol_name, strength_unit FROM st_om_rtd_gas
WHERE ps_code = 440000000168 AND p_code = 1101 AND moni_time >= '2021-12-27 00:00:00.000';
Latency of the previous query

Returning 5,280 rows of data took only 0.1 seconds, and the performance completely exceeded our expectations.

It is worth mentioning that we are currently using virtual machines in our own data center and the performance of TDengine does not falter at all. In the near future, we plan to migrate to a cloud platform with better resources, and we believe that the performance of TDengine will increase.

Next step: implementing edge layer construction based on TDengine

After using TDengine for several months, it has far exceeded expectations in terms of CPU load, data storage efficiency, data collection efficiency, and data query efficiency, meeting all the requirements of our real-time data warehouse. TDengine has strengthened our process of ecological and environmental data governance and has allowed accurate and timely detection of anomalous pollution discharge. TDengine has great potential in pollution prevention and control scenarios.

As our next step in fully exploiting the features of TDengine we plan to use some of it’s features to enhance the edge layer of our platform. Specifically, we plan to use the subscription function of TDengine to capture data from the edge and also process it immediately on the edge side to enhance real-time pollution prevention and control.

I am looking forward to newer releases of TDengine and hoping for a bigger and bigger TDengine community in the coming years.

About the author:

Shen Hualie, system analyst, has nearly 20 years of experience in environmental information science. He is currently working at the Environmental Information Research Institute of Guangdong Academy of Environmental Sciences.