Deploying an IT infrastructure monitoring system

Introduction

A stable system for real-time IT infrastructure monitoring is an absolute necessity for a robust and efficient application system. Monitoring refers to collecting key metrics of system performance or resource usage status at periodic intervals over time. One of the primary uses of this monitoring data is to provide real-time status reporting and alerting, based on which the system decides what actions need to be triggered to maintain a stable running status.

The information extracted from IT infrastructure monitoring data over the long term can also provide critical indicators that administrators or developers can use to refine the current system implementation or identify potential problems that might occur when the system is upgraded, new components are deployed, or usage increases.

However, depending on the number of devices and metrics being monitored, the volume of the monitoring data can be extremely large, and this volume will keep growing with the number of monitored devices and metrics. Traditional relational databases like MySQL may be a solution for IT infrastructure monitoring data at a smaller scale, but their performance deteriorates dramatically when the size of the data set grows to a certain level. At this point, such databases are generally no longer able to give real-time and on-demand responses. A Hadoop-based ecosystem can provide the necessary performance when horizontally scaled to a large number of nodes, but this will have a significant impact on the project budget.

Industries may focus on monitoring different aspects of their systems, but the collected data have one thing in common – they are all time-series data and mostly analyzed in terms of time. For this reason, it is becoming more and more common for industries to use a purpose-built time-series database as the persistence layer of the acquired monitoring data. Time-series databases read and write time-series data much faster than general-purposed databases or storage systems, making time-series databases the best choice for persisting and analyzing structured sequential data like IT infrastructure monitoring data.

Next, an easy method of deploying an IT infrastructure monitoring system will be described, featuring TDengine as the data processing component.

System Overview

The standard hardware metrics that are monitored in an IT scenario are CPU usage, memory usage, disk usage, disk I/O, and network availability – all of which are structured time-series data. For software, the monitoring metrics are often service availability, connections, requests, timeouts, errors, and response time.

An IT infrastructure monitoring system typically consists of acquisition, storage, analytics, and visualization modules:

  • Data acquisition: The diversity, accuracy, consistency, and timeliness of the acquired data directly affect the eventual effectiveness of the monitoring system. In this article, industry leader Telegraf is used as the data acquisition component. Telegraf is a plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors.
  • Data storage: The optimal data storage engine for a use case depends on the characteristics of the data monitored. For structured time-series data, deploying a purpose-built time-series database offers the best performance and highest efficiency. In this article, open-source time-series database TDengine is used as the data storage component.
  • Data analytics: The purpose of collecting data is to gain insight from it, which is the role of analytics. Generally, time-series data analysis is performed either in real time or on historical data:
    • Real-time analytics emphasizes instantly computing and reporting the latest collected data. This could include retrieving the latest values of certain fields, alerting, generating real-time aggregation result curves, stream computing, and sliding windows.
    • Historical analytics focuses on analyzing data from the past over a much longer time span to discover trends, patterns, correlations, and other statistical relationships. The complex nature of this type of analytics poses higher requirements on the read and write performance of the data engine.
  • Data Visualization: Once data has been collected and analyzed, it must then be visualized so that the trends, outliers, and patterns in data can be understood. In this article, Grafana is used as the data visualization tool. Grafana is a open-source visualization tool for time-series data featuring a customizable GUI that allows users to query, visualize, and receive alerts on their metrics.

System Deployment

TDengine can integrate quickly and seamlessly with Telegraf and Grafana without any coding. The data flow is shown in the figure below. ​

IT infrastructure monitoring data flow
Figure 1. Data processing flow for IT infrastructure monitoring

Here are the steps to set up a CPU usage monitoring system using the above tools.

Installing and Configuring TDengine

To begin, install TDengine according to the steps provided in Get Started. Once the installation has completed, ensure that taosd and taosadapter are both running.

sudo systemctl status taosd
sudo systemctl status taosadapter

Then open the TDengine console by running the taos command and create a user to own the data collected by Telegraf. In this example, monitoringuser is used as the user name.

CREATE USER monitoringuser PASS 'password';

Installing and Configuring Telegraf

Install Telegraf according to the official documentation. Once the installation has completed, modify the Telegraf configuration file and add an output plug-in to send data to TDengine.

A typical configuration for TDengine is as follows:

[[outputs.http]]
url = "http://localhost:6041/influxdb/v1/write?db=monitoringdb"
method = "POST"
timeout = "5s"
username = "monitoringuser"
password = "password"
data_format = "influx"
influx_max_line_bytes = 250

As this example is monitoring a single machine, localhost is used for the TDengine cluster location. The database monitoringdb will be created in your TDengine cluster when Telegraf begins sending data to TDengine.

Install and Configure Grafana

At this point, Telegraf is collecting IT monitoring metrics from your local machine and writing them into TDengine. You can confirm this by opening the TDengine console, using the monitoringdb database, and running a command such as SHOW TABLES;. The collected data now needs to be visualized.

To do so, install Grafana according to the official documentation. Once the installation has completed, open Grafana and log in. Then install the official plug-in for TDengine and add your TDengine deployment as a data source. Now that data from TDengine can be used in your Grafana deployment, add the Telegraf System Dashboard to visualize it.

Conclusion

Monitoring systems contain data acquisition, persistence, analytics, and visualization components. When monitoring large data sets, the bottleneck in terms of system performance is usually in the persistence layer. By deploying a purpose-built time-series database like TDengine as your data storage platform and then integrating with data acquisition and visualization components like Telegraf and Grafana, you can quickly create an IT infrastructure monitoring system that provides the information you need.