- A TDengine cluster of five servers ingests 500 billion data points every day, adding up to 900 GB of data, without performance deterioration.
- Data compression ranges from up to 5% (20:1) for integers to up to 15% for text, significantly reducing storage costs.
- Specialized analytics applications are implemented directly in TDengine via user-defined functions.
A world-renowned earthquake research center had begun collecting a large volume of seismic waveform data from seismometers in the region. This data, characterized by typical time-series features, would be used for earthquake monitoring, early warning systems, data analysis, anomaly assessment, and other applications. The center decided to deploy a new database management system to store and process this data and defined the following requirements for the system:
- Efficient ingestion and querying, capable of handling 5 million data points per second and supporting rapid ad-hoc queries.
- High data compression ratio, storing as much data as possible while minimizing disk storage costs.
- Time-series analysis with time-based windowing functions or tools.
- Elastic scalability to accommodate increasing data volumes as the project expands.
- High security and reliability with multi-replica storage and high availability to prevent single-point failures.
- Data archiving, separating real-time data from historical data based on access frequency.
- User-defined functions for integration of specialized seismic signal processing algorithms.
After consideration, the center selected TDengine as its time-series database.
System Architecture
This project uses a 5-node TDengine cluster in which each server is configured with 48 cores, 192 GB of RAM, a 500 GB SSD, and six 1.2 TB HDDs.
The system architecture is described in the following figure.
Each seismometer typically collects seismic motion data through three channels (two horizontal and one vertical component) with a sampling frequency of 100 Hz, or one sample every 10 milliseconds. After the data in each packet is parsed using proprietary tools, it is written into TDengine. While data is being written, TDengine’s subscription feature extracts each data packet, further parses it, and stores additional layers of data in the TDengine cluster.
The location of each station is identified by its network code, station code, and sensor position number. The seismometers collect seismic motion data 24 hours a day, packaging data into miniSEED packets every 0.5 seconds to create a data stream. Each of these data streams corresponds to a data table and a metadata table.
Field | Type | Notes |
---|---|---|
ts | TIMESTAMP | Timestamp for seismic data point |
data | INT | Amplitude |
network | BINARY(20) | Tag |
station | BINARY(20) | Tag |
location | BINARY(20) | Tag |
channel | BINARY(20) | Tag |
Field | Type | Notes |
---|---|---|
ingesttime | TIMESTAMP | Ingestion time |
delay | BIGINT | Ingestion delay |
starttime | TIMESTAMP | Start time of miniSEED packet |
endtime | TIMESTAMP | End time of miniSEED packet |
npts | INT | Number of samples in miniSEED packet |
samprate | INT | Sampling rate of miniSEED packet |
network | BINARY(20) | Tag |
station | BINARY(20) | Tag |
location | BINARY(20) | Tag |
channel | BINARY(20) | Tag |
At present, the cluster contains approximately 59,000 data tables, each representing a seismic data stream flowing into the database, and another 59,000 metadata tables that store descriptive information for each data packet within the streams.
Since the project began, TDengine has been processing more than 50,000 data packets per second, totaling about 500 billion data entries daily, or 900 GB. TDengine compresses columns storing integers at a ratio of 5% to 10%, while the compression ratio for string columns reaches 15% to 20%, significantly reducing storage costs.
Each database server shows CPU usage of 40% to 50% and memory usage of 14-20%, indicating stable operation in the current specifications.
Applications
The earthquake research center now queries TDengine for critical tasks such as finding amplitude data for specific instruments and identifying data packets that have high latency. In addition, a pre-trained phase detection model, integrated into TDengine as a user-defined function, is used for real-time phase picking by means of TDengine’s built-in stream processing.
As another example of the benefits of stream processing, the center calculates the maximum peak value per second across thousands of seismometers in real time as follows:
CREATE STREAM xxxxxxx
INTO max_val_per_seconds_test
SUBTABLE(concat('max_test_val_per_seconds-', tbname))
AS SELECT ts, MAX(data) AS max_data
FROM data
WHERE channel_3th = 'Z'
PARTITION BY tbname
INTERVAL(1s)
SLIDING(1s);
Results for queries against this stream return in under 1 second, making actual real-time analysis a possibility instead of just a buzzword.
Additionally, TDengine seamlessly integrates with Grafana data visualization. For example, the following dashboard provides a visual representation of the seismic activity at a certain location over a specified time period.
Benefits of TDengine
Before TDengine was deployed, miniSEED files were segmented, processed, and analyzed with custom-built Python and C programs. TDengine has streamlined and optimized this workflow as follows:
- MiniSEED files are now parsed into time-series data and stored in separate tables in TDengine, eliminating the time required to unpack data.
- Real-time querying, visualization, and analysis are now performed with SQL statements, reducing the technical complexity for operators.
- The scalability of the system is significantly enhanced, allowing for future expansion to handle potential increases in data volume.
- New functionality and machine learning models can be integrated into TDengine as UDFs when needed for future projects.
Conclusion
Seismic data is inherently time-based, and large-scale applications can benefit greatly from a time-series database. TDengine’s high performance and flexibility, in particular its native support for user-defined functions and stream processing, make it a powerful, adaptable, and cost-effective solution for seismological data management and analytics, helping researchers and monitoring centers efficiently handle, analyze, and visualize seismic data.