Anomaly Detection in IT Monitoring with TDgpt

Joel Brass

March 25, 2025 / Time-Series Database Essentials

In server operations and maintenance, keeping a close eye on key metrics like CPU, memory, disk, and network is much like having a regular checkup at the doctor’s. Continuous monitoring of these indicators not only provides insights into system performance but also helps detect potential issues early using intelligent analysis tools.

For example, a sudden spike in CPU usage might point to a code bug introduced after a software update, a hidden crypto-mining virus, or even signs of hardware degradation. Some fluctuations are normal—such as regular spikes during scheduled tasks—but unexpected and sustained high loads should be investigated and resolved immediately.

Traditional monitoring methods rely on manually set fixed thresholds, which often leads to false alarms or missed issues. Today, by analyzing historical data to establish dynamic baselines, systems can automatically detect genuine anomalies. For instance, after a version update, the algorithm might detect that a service’s CPU usage is 30% higher than usual and immediately trigger an alert—something that might previously have been dismissed as normal business growth.

The benefits of intelligent monitoring are tangible: it prevents unnecessary alerts that wake you up in the middle of the night, while still catching serious issues that could lead to server outages. By comparing normal and abnormal server behavior, the system acts like a seasoned operations engineer—accurately distinguishing between software bugs, security breaches, and hardware failures, and providing clear guidance for follow-up actions.

This article provides a guide for quickly setting up a TDgpt test environment using Docker Compose and demonstrates the full process of performing anomaly detection in an operations monitoring scenario using real-world data.

What Is TDgpt?

TDgpt is an intelligent agent built into TDengine for time-series data analysis. Leveraging TDengine’s time-series query capabilities, it provides advanced analysis functions—such as time-series forecasting and anomaly detection—through SQL, with the ability to dynamically extend and switch analysis tasks at runtime. By integrating prebuilt time-series foundation models, large language models, machine learning, and traditional algorithms, TDgpt enables engineers to deploy time-series forecasting and anomaly detection models within 10 minutes. This reduces the development and maintenance costs of time-series analysis models by at least 80%.

Learn more about TDgpt.

Prepare the Demo Dataset

The data in this demo comes from the public NAB dataset and represents CPU usage from a server during an API gateway failure at Amazon’s AWS East Coast data center. The data has a 5-minute sampling frequency and is measured in CPU utilization percentage. Due to the API gateway failure, applications on the affected server entered a cycle of frequent error handling and retries, leading to abnormal CPU usage fluctuations. TDgpt’s anomaly detection algorithms are designed to identify such anomalies accurately.

This data file is located in the demo_data directory of the TDgpt-demo repository. The following procedure shows how to import the data into TDengine and complete the demonstration. The dataset is described as follows:

Dimension	Value
Records	4021
Time range	2014-03-07 03:41:00 to 2014-03-21 03:41:00
Maximum	99.25
Minimum	22.86
Average	45.16

Preparing the Demo Environment

Prerequisites

The demo is run in Docker and does not require a specific operating system. However, the following are required to use Docker Compose:

Git
Docker Engine: v20.10+
Docker Compose: v2.20+

The demo contains three Docker containers (TDengine, TDgpt, and Grafana) and shell scripts that generate forecasting or anomaly detection results.

Procedure

Clone the demo repository and make the script file executable:

git clone https://github.com/taosdata/TDgpt-demo
cd TDgpt-demo
chmod 775 analyse.sh

Navigate to the directory containing the docker-compose.yml file and run the following command to start the integrated demo environment with TDengine, TDgpt, and Grafana:
```
docker-compose up -d
```

Wait 10 seconds and then register the anode to TDengine:

docker exec -it tdengine taos -s "create anode 'tdgpt:6090'"

Initialize the data for the test environment:

docker cp analyse.sh tdengine:/var/lib/taos
docker cp demo_data tdengine:/var/lib/taos
docker exec -it tdengine taos -s "source /var/lib/taos/demo_data/init_ec2_failure.sql"

Your demo environment has now been created. To remove the environment if it is no longer needed, run the following command:

docker-compose down

Running the Demo

Open your browser and go to http://localhost:3000, then log in to Grafana using the default username and password: admin / admin.
After logging in successfully, navigate to “Home → Dashboards”, then import the ec2_failure_anomaly.json file to load the preconfigured dashboard.
After importing, select the “ec2_failure_anomaly” dashboard. The dashboard is already configured to display the actual values along with detection results from the k-Sigma and Grubbs algorithms. At this point, only the data curve for the actual values is visible.
Run analyse.sh to begin forecasting. First try forecasting with k-sigma:
```
docker exec -it tdengine /var/lib/taos/analyse.sh --type anomaly --db tdgpt_demo --table ec2_failure --stable single_val --algorithm ksigma --params "k=3" --start "2014-03-07" --window 7d --step 1h
```
The shell script mentioned above starts from a specified timestamp (2024-03-07) and uses a 7-day sliding window as input to perform anomaly detection on the ec2_failure table using the k-sigma algorithm. It continues this process until it reaches the last record in the ec2_failure table. The results are written to the ec2_failure_ksigma_result table.

Before running another forecast, the script will create or remove the results table. During execution, the console will continuously display output in hourly increments, showing results like the following:
```
Processing window: 2014-03-07 02:00:00 → 2014-03-14 02:00:00
Welcome to the TDengine Command Line Interface, Client Version:3.3.6.0
Copyright (c) 2023 by TDengine, all rights reserved.

taos> INSERT INTO ec2_failure_ksigma_result
                  SELECT _wstart, avg(val)         
                  FROM ec2_failure
                  WHERE ts >= '2014-03-07 02:00:00' AND ts < '2014-03-14 02:00:00'
                  ANOMALY_WINDOW(val, 'algo=ksigma,k=3')
Insert OK, 10 row(s) affected (0.326801s)
```
Here, a 1-hour step size (–step) is used to accelerate the dynamic detection process for demonstration purposes. However, –step can also be set to a finer granularity, such as 5 minutes, to produce more real-time detection results. In actual application scenarios, users are encouraged to adjust this parameter based on data resolution, real-time detection requirements, and available computing resources.
In the Grafana dashboard, set the refresh interval to 5 seconds to dynamically display the yellow curve representing the prediction results. This allows for a clear visual comparison with the actual values. For better clarity, hold the Command key (on Mac) or the Windows key (on Windows) and click on the “Real” and “ksigma” legends in the bottom left corner to display only these two curves.

Now forecast using Grubbs’ test:

docker exec -it tdengine /var/lib/taos/analyse.sh --type anomaly --db tdgpt_demo --table ec2_failure --stable single_val --algorithm grubbs --start "2014-03-07" --window 7d --step 1h

From the prediction results, we can see that compared to the k-sigma algorithm with the default parameter of k=3, Grubbs’ test produces fewer detection errors. The k-sigma method generated more false positives.

By using the mouse to select a specific time range, we can zoom in and examine the fine-grained comparison between the predicted results and the actual values over that period.

You can also experiment with other algorithms or models to find the one that best fits your specific scenario.

Conclusion

In this article, we demonstrated the process of using TDgpt for anomaly detection in IT monitoring. As shown, building time-series data analysis with TDgpt allows for seamless integration with applications through SQL, greatly reducing the cost of developing and applying time-series forecasting and anomaly detection solutions.

Joel Brass
Joel Brass is a Solutions Architect at TDengine, bringing extensive experience in real-time data processing, time-series analytics, and full-stack development. With a 20 year background in software engineering and a deep focus on scalable applications and solutions, Joel has worked on a range of projects spanning joke databases, IoT, self-driving vehicles, and work management platforms. Prior to joining TDengine, Joel worked in Advisory Services for Enterprise customers of Atlassian and the Systems Engineering team at Waymo. He is currently based in the San Francisco Bay Area.