In addition to AI and ML, TDgpt also offers a range of traditional statistical algorithms for analyzing time-series data. These can be used for datasets that do not perform well with existing AI models, in environments with limited compute resources, and in other scenarios where enterprises are not ready to implement AI. And if your organization has developed custom algorithms, you can import them into TDgpt and use them in SQL statements.
This article discusses the algorithms included with TDgpt and their use cases along with a usage example.
Time-Series Forecasting Algorithms
Holt–Winters
The Holt–Winters algorithm, also known as triple exponential smoothing, is a popular time-series forecasting method that accounts for level (the average value in the series), trend (increasing or decreasing pattern), and seasonality (repeating patterns over fixed periods).
TDgpt uses additive Holt–Winters if seasonal variation remains mostly consistent within a time series and multiplicative Holt–Winters if seasonal variation is proportional to the level of the time series.
Holt–Winters is suitable when your dataset displays trend and seasonal variations and you want to forecast future values in the short-term with a relatively simple, interpretable model.
ARIMA
Autoregressive integrated moving average (ARIMA) is used for univariate time-series forecasting on datasets without strong seasonality. ARIMA makes the data stationary by differencing, fits it to a linear model combining past values and errors, and then forecasts future values based on this model.
ARIMA is suitable for non-seasonal, stationary time-series datasets and requires data teams to be comfortable optimizing its p, d, and q values.
Anomaly Detection Algorithms
IQR | k-sigma | S-H-ESD | Grubbs | LOF | |
---|---|---|---|---|---|
Type | Statistical | Statistical | Statistical | Statistical | Density-based |
Requires normal distribution | No | Yes | No | Yes | No |
Handles trends and seasonality | No | No | Yes | No | No |
Requires significant compute resources | No | No | Yes | No | Yes |
Handles real-time data well | No | Yes | No | No | No |
Comparison of anomaly detection algorithms
IQR
The interquartile range (IQR) is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of a dataset. Once you define boundaries based on the IQR, any data point outside the range is considered a potential anomaly.
IQR is easy to use, good for batch or window-based time series, and does not assume any specific distribution of data. However, it does not account for trends or seasonality and is not typically suitable for real-time or streaming data.
k-sigma
The k-sigma method is a classic approach to anomaly detection for normally distributed data. In this method, the mean and standard deviation of the dataset are calculated. Data points that deviate from the mean by more than k standard deviations are then considered anomalies.
k-sigma is simple and fast, especially for data in a normal distribution. That said, it is sensitive to skew from outliers and also does not handle seasonal or non-stationary data without modification.
S-H-ESD
Seasonal Hybrid ESD (S-H-ESD) decomposes time series into trend, seasonality, and residuals — meaning what’s left over after removing trend and seasonality. It then applies the extreme studentized deviate test to the residuals and identifies values that are statistically distant from the expected distribution, based on a significance level.
S-H-ESD is designed to handle seasonality and is more accurate than IQR or k-sigma, but cannot handle real-time data and is compute-intensive.
Grubbs’s Test
Grubbs’s test determines whether the most extreme value in your dataset (maximum or minimum) is significantly different from the rest, based on how many standard deviations it is away from the mean.
Grubbs’s requires normally distributed data that does not include trends or seasonality, and must be applied iteratively as it detects one outlier at a time. If you suspect that a single point in your dataset may be an anomaly, however, Grubbs’s can be a useful tool.
LOF
Local Outlier Factor (LOF) is a density-based method for anomaly detection, useful for multivariate data where simple statistical techniques like k-sigma or IQR are less effective. LOF measures the local density deviation of a given data point with respect to its neighbors, considering a point with significantly lower density to be an anomaly.
LOF requires larger datasets to be effective and requires more compute resources than other algorithms, but works well with high-dimensional data and does not require any specific distribution.
Forecasting Example
The following SQL statement uses Holt–Winters to forecast data from a data column:
SELECT _frowts, FORECAST(<column-name>, "algo=holtwinters") from <table-name>;
Anomaly Detection Example
The following SQL statement uses IQR to detect anomalies in a data column:
SELECT _WSTART, COUNT(*) FROM <table-name> ANOMALY_WINDOW(<column-name>, "algo=iqr");
For more information about the usage of TDgpt and specific algorithms, see the official documentation.