Earthshaking Speed: TDengine Ingests 50,000 Seismological Records per Second

Chait Diwadkar

October 31, 2024 / Case Studies

Highlights

A TDengine cluster of five servers ingests 500 billion data points every day, adding up to 900 GB of data, without performance deterioration.
Data compression ranges from up to 5% (20:1) for integers to up to 15% for text, significantly reducing storage costs.
Specialized analytics applications are implemented directly in TDengine via user-defined functions.

A world-renowned earthquake research center had begun collecting a large volume of seismic waveform data from seismometers in the region. This data, characterized by typical time-series features, would be used for earthquake monitoring, early warning systems, data analysis, anomaly assessment, and other applications. The center decided to deploy a new database management system to store and process this data and defined the following requirements for the system:

Efficient ingestion and querying, capable of handling 5 million data points per second and supporting rapid ad-hoc queries.
High data compression ratio, storing as much data as possible while minimizing disk storage costs.
Time-series analysis with time-based windowing functions or tools.
Elastic scalability to accommodate increasing data volumes as the project expands.
High security and reliability with multi-replica storage and high availability to prevent single-point failures.
Data archiving, separating real-time data from historical data based on access frequency.
User-defined functions for integration of specialized seismic signal processing algorithms.

After consideration, the center selected TDengine as its time-series database.

System Architecture

This project uses a 5-node TDengine cluster in which each server is configured with 48 cores, 192 GB of RAM, a 500 GB SSD, and six 1.2 TB HDDs.

The system architecture is described in the following figure.

Each seismometer typically collects seismic motion data through three channels (two horizontal and one vertical component) with a sampling frequency of 100 Hz, or one sample every 10 milliseconds. After the data in each packet is parsed using proprietary tools, it is written into TDengine. While data is being written, TDengine’s subscription feature extracts each data packet, further parses it, and stores additional layers of data in the TDengine cluster.

The location of each station is identified by its network code, station code, and sensor position number. The seismometers collect seismic motion data 24 hours a day, packaging data into miniSEED packets every 0.5 seconds to create a data stream. Each of these data streams corresponds to a data table and a metadata table.

Field	Type	Notes
ts	TIMESTAMP	Timestamp for seismic data point
data	INT	Amplitude
network	BINARY(20)	Tag
station	BINARY(20)	Tag
location	BINARY(20)	Tag
channel	BINARY(20)	Tag

Field	Type	Notes
ingesttime	TIMESTAMP	Ingestion time
delay	BIGINT	Ingestion delay
starttime	TIMESTAMP	Start time of miniSEED packet
endtime	TIMESTAMP	End time of miniSEED packet
npts	INT	Number of samples in miniSEED packet
samprate	INT	Sampling rate of miniSEED packet
network	BINARY(20)	Tag
station	BINARY(20)	Tag
location	BINARY(20)	Tag
channel	BINARY(20)	Tag

At present, the cluster contains approximately 59,000 data tables, each representing a seismic data stream flowing into the database, and another 59,000 metadata tables that store descriptive information for each data packet within the streams.

Since the project began, TDengine has been processing more than 50,000 data packets per second, totaling about 500 billion data entries daily, or 900 GB. TDengine compresses columns storing integers at a ratio of 5% to 10%, while the compression ratio for string columns reaches 15% to 20%, significantly reducing storage costs.

Each database server shows CPU usage of 40% to 50% and memory usage of 14-20%, indicating stable operation in the current specifications.

Applications

The earthquake research center now queries TDengine for critical tasks such as finding amplitude data for specific instruments and identifying data packets that have high latency. In addition, a pre-trained phase detection model, integrated into TDengine as a user-defined function, is used for real-time phase picking by means of TDengine’s built-in stream processing.

As another example of the benefits of stream processing, the center calculates the maximum peak value per second across thousands of seismometers in real time as follows:

CREATE STREAM xxxxxxx
INTO max_val_per_seconds_test
SUBTABLE(concat('max_test_val_per_seconds-', tbname))
AS SELECT ts, MAX(data) AS max_data
FROM data
WHERE channel_3th = 'Z'
PARTITION BY tbname
INTERVAL(1s)
SLIDING(1s);

Results for queries against this stream return in under 1 second, making actual real-time analysis a possibility instead of just a buzzword.

Additionally, TDengine seamlessly integrates with Grafana data visualization. For example, the following dashboard provides a visual representation of the seismic activity at a certain location over a specified time period.

Benefits of TDengine

Before TDengine was deployed, miniSEED files were segmented, processed, and analyzed with custom-built Python and C programs. TDengine has streamlined and optimized this workflow as follows:

MiniSEED files are now parsed into time-series data and stored in separate tables in TDengine, eliminating the time required to unpack data.
Real-time querying, visualization, and analysis are now performed with SQL statements, reducing the technical complexity for operators.
The scalability of the system is significantly enhanced, allowing for future expansion to handle potential increases in data volume.
New functionality and machine learning models can be integrated into TDengine as UDFs when needed for future projects.

Conclusion

Seismic data is inherently time-based, and large-scale applications can benefit greatly from a time-series database. TDengine’s high performance and flexibility, in particular its native support for user-defined functions and stream processing, make it a powerful, adaptable, and cost-effective solution for seismological data management and analytics, helping researchers and monitoring centers efficiently handle, analyze, and visualize seismic data.

Chait Diwadkar
Chait Diwadkar previously worked as Director of Solutions Engineering at TDengine.