TravelGo reduces hardware costs by 50%

Introduction

TravelGo has an in-house developed, basic monitoring system called “Nightingale Monitoring”. Nightingale monitors millions of endpoints, 100 million metrics and has 2 million concurrent writes per second with 20,000 concurrent queries. Its storage component is based on RRD storage. Although RRD storage has good performance, the high resource consumption has led TravelGo to look for ways to reduce hardware costs.

In addition, some data is lost if the machine restarts, as the memory cache periodically writes to RRD. The reason for this problem is that RRD writing is a single-point write, which cannot be automatically switched after a machine failure. This makes it impossible to display real-time or recent data for longer periods. In response to this problem, Nightingale  incorporated a lot of high-availability elements, but it is still difficult to meet the business requirements. To alleviate this, the following transformations were made:

  • 2 sets of ElasticSearch storage were introduced which provide raw data query capability for data less than 7 days old.
  • API calls are provided against the RRD database and call volume in the tens of thousands of TPS is supported.
  • The Nightingale panel can access ElasticSearch, where original data for the past 7 days is stored and it can support several hundred QPS.

However, as the number of parameters being monitored increased, the resource consumption of both the two sets of storage systems increased. The business requirements for aggregate computing functions for monitoring were also increasing and so there was a need to look for a new time-series platform to reduce operational and maintenance costs without sacrificing scalability.

The most relevant high-level requirements were as follows:

  • A high performance platform capable of tens of millions of concurrent writes and 100,000 concurrent reads.
  • High availability with horizontal scalability and no single point of failure.
  • Built-in functions including four arithmetic operations, maximum, minimum, average, latest and aggregate calculation functions.

After comparing InfluxDB, TDengine, Prometheus, Druid, ClickHouse and many other popular database products on the market, TDengine was the one that met all of our selection requirements.

Schema design and development

The Nightingale monitoring system has data for both system parameters as well as business parameters. The former, such as CPU, memory, disk, and network, are predictable indicators. The indicator names are fixed, and the total number is about 20 million. Business parameters are defined by the business through an API which is monitored by Nightingale. The parameters names may change over time and while the concurrency is not very high, the number of records can reach 100 million a year.

For TDengine, it is optimal to define the table schema so as to get the maximum possible performance. From the requirements above, one might be tempted to flatten the structure and create a large number of parameters at one time but this will degrade performance. After communicating with TDengine technical support staff, we were given two schema options:

Option 1 – Aggregate the basic system parameters into a super table. Under the supertable, each sub-table stores data for one node, and multiple parameters are written at one time. The advantage of this method is that the number of tables can be reduced to tens of millions. However, because the data monitored by Nightingale is uploaded in a single piece, it is difficult to collect all the indicators in a table before writing. And different parameters have different upload frequencies. If different super tables are built based on the frequency, the operation and maintenance management cost will be very high.

Option 2 – Different parameters are built into sub-tables one by one, and about 50 million indicators are aggregated into a cluster, which is divided into multiple clusters. The advantage of this method is that the table structure is simple, but it will be troublesome to operate and manage multiple clusters. However, TDengine 3.0 to be released in 2022, will support more than 100 million tables and so this solution would work for us.

We chose Option 2 and at the same time, in order to reduce the number of clusters, we wrote additional software to clean up expired sub-tables on a regular basis. At present, the structure of the super table monitored by Nightingale is as follows.

Supertable structure

After Nightingale monitoring is connected to TDengine, the architecture diagram is as follows.

Deployment

During data migration, we first transferred the Nightingale monitoring data to Kafka, and then converted the Kafka data format to SQL for ingestion into TDengine. In this process we encountered the following three issues and below we describe the issues and their solutions.

  • Connection method problem – Initially we used the Go connector to access TDengine. The Go connector depends on the taos client package and the network configuration in the taos.cfg configuration file. The FQDN setting in particular makes VIP (Virtual IP) load balancing difficult. Since we were planning to deploy in containers, we switched to the RESTful API.
  • Kafka consumption quantity problem – Data uploaded to Kafka has a concurrency of about 2 million and so the number of connections is quickly exhausted. Through communication with TDengine technical support staff, we learned that we can use batch SQL to write. We get the best performance when the length of a single SQL statement is about 400-600K. After calculation, the number of parameters we upload is about 5000, and the size is 500K.
  • Read speed problem – When 5000 pieces of data are consumed, a write is triggered, which in turn leads to slower reading speed. TDengine technical support staff once again gave a solution – use TDengine’s own queue which can be found here.

In practice, TDengine’s data writing performance is extremely high. Originally, our single storage system required more than 10 high-end machines, and the data could be written only when the average IO was 30% and up to 100%; now we only need 7 machines, and the CPU consumption is about 10%, and the disk IO consumption is 1%.

Resource consumption after deploying TDengine
Resource consumption before deploying TDengine

At the same time queries have extremely low latency. After switching to the RESTful interface, combined with the powerful aggregation functions that are built into TDengine, it is easy to perform complex queries and get results instantaneously.

Conclusion

TDengine has demonstrated superior performance in many aspect. It’s write performance and compression rate are extremely high. It provides a comprehensive list of aggregation functions. It includes Sum/Max/Min/Avg/Top in addition to functions such as Last/First/Diff/Percentile. TDengine also offers a Grafana-based zero-dependency monitoring solution called TDinsight which can be used to monitor TDengine itself.

In the future, we also hope to carry out deeper cooperation with TDengine. We offer some small suggestions to help TDengine:

  • The official documentation is not perfect. Functions in the new version are not always reflected in the documentation and there is a dearth of code examples. This makes it difficult for those who are new to time-series databases (TSDB).
  • User experiences should be better documented in a format and location that is easily searchable by Google and search engines. It is difficult for search engines to find solutions from the community for common issues.

I would like to thank TDengine’s technical support staff for their full support. Although TDengine is still in the early stages of development and there are still some optimizations to be made, its excellent performance was a pleasant surprise. All in all, TDengine is a very good time-series platform and we believe it will develop better and better under the leadership of Jeff Tao!