Enabling Millisecond-Level Response Time for a Massive Time-Series Data Set

Lynx (POIZON)
Lynx (POIZON)
/
Share on LinkedIn

POIZON, is one of the largest sneaker trading platforms, where users can buy, sell and verify the authenticity of limited edition sneakers and accessories. It offers rich social features that allow the community to share comments and photos of their favorite outfits directly in the app. Launched in 2015, POIZON entered the ranks of unicorn companies in April 2019.

For a leading e-commerce company, data monitoring is a pivotal module in many systems and scenarios. Currently, POIZON has to handle hundreds of millions of data records per day and deals with over 10 thousand transactions per second.

To tackle such a challenge, we firstly employed Sentinel, an open-source flow control component, to satisfy our requirements for flow control & safeguarding. However, things were not always easy, we found that the open-source edition of Sentinel cannot support data persistence during the development of Sentinel. Data persistence plays a crucial part in our design, hence we have to look for a new solution — a powerful database that shows industry-leading high performance in data ingestion, data query, and big data storage.

In the existing production environment, hundreds of systems and thousands of servers are connected to Sentinel, which generates enormously large amounts of data. Therefore, high reliability, high availability, and horizontal scalability are our key requirements for the new solution.

Why POIZON Chose TDengine

As we’ve referred above, hundreds of millions of data records will be generated in the existing production environment per day, which also means that the whole system needs to process thousands of transactions per second. Considering the consistently growing business scale, the data volume will undoubtedly explode. Regarding such a large and rapidly-growing data volume, traditional relational database management systems (RDBMS) cannot fit in our usage scenarios and business systems. Thus, we focus on looking for solutions provided by time-series databases.

According to our research, we listed the pros and cons of several prevailing time-series databases:

  • InfluxDB: InfluxDB is one of the most leading time-series databases and is widely used in similar scenarios with POIZON. However, the open-source edition of InfluxDB doesn’t support clustering.
  • OpenTSDB: OpenTSDB is a time-series database written on top of HBase, hence it heavily relies on HBase to implement data storage. In this case, the deployment of OpenTSDB would be too complex and cost more overhead.
  • Apache Cassandra: Apache Cassandra is an open-source NoSQL distributed database, while it cannot meet our requirements according to the performance testing reports we have collected.

When we were disappointed by the unfavorable research outcomes, we surprisingly encountered TDengine, an open-source time-series database capable of handling high write and query loads with industry-leading high storage efficiency.

We were so excited and then developed a demo to test the performance of TDengine. The testing results are as expected — TDengine presented excellent performance during the testing. According to the research, TDengine demonstrated competitive features such as zero learning curve, lower operations & maintenance costs, horizontally scalable, etc. in the case studies. Additionally, TDengine has a mature and active community that provides timely and professional technical support.

Through a comprehensive decision-making process, we formally decided to use TDengine as our new data monitoring solution.

How to Plan Data Architecture & Data Modeling

Data Architecture

In Sentinel, metric data is displayed on the dashboard for different applications, metrics including Timestamp, Passed QPS, Blocked QPS, and Response Time (RT). Therefore, the unique key of data is “application-resource” from the perspective of the front-end dashboard.

What does data architecture look like from the perspective of internal implementation? Sentinel Client will monitor the traffics of all resources on every single server and then integrate & write the metric data into the local log files. At the same time, Sentinel Dashboard will acquire the metric data via the connector of Sentinel Client and then integrate & write the traffic data from all single machines into the memory. Thus, the unique attribute of all data stored in the database is “application-resource”.

Data Modeling

The Documentation highlights the following features of data modeling in TDengine:

“To utilize this time-series and other data features, TDengine requires the user to create a table for each collection point to store collected time-series data. 

“The method of one table for each data collection point can ensure the optimal performance of insertion and query of a single data collection point to the greatest extent.

“In the design of TDengine, a table is used to represent a specific data collection point, and STable is used to represent a set of data collection points of the same type. When creating a table for a specific data collection point, the user uses the definition of STable as a template and specifies the tag value of the specific collection point (table). Compared with the traditional relational database, the table (a data collection point) has static tags, and these tags can be added, deleted, and modified afterward. 

“A STable contains multiple tables with the same time-series data schema but different tag values.”

As presented above, the suggested data modeling completely fits in the characteristics of data in our usage scenario — an “application-resource” (data point) is a table, all “application-resource” (the same type of data points) can be put into a STable to facilitate aggregations. Therefore, we adopted the data modeling method in the documentation. Meanwhile, we also set up some information of applications as tags in tables for further aggregations.

Overall Architecture

In light of the data processing pipeline shown above, all business systems that have connected to Sentinel will periodically send heartbeat requests to Sentinel Dashboard for ensuring the smooth and stable running status of the machine. Meanwhile, the dashboard will perform polling on all machines and pull & integrate metrics into the TDengine cluster through batch writing.

How POIZON Deployed TDengine

Connector

The main development language in our company is Java, so JDBC is our first choice amongst various connectors supported by TDengine. Besides, JDBC not only outperforms HTTP in performance but also supports automatically switching to alternative nodes if a node cannot work. Although JDBC-JNI has to rely on functions in the native library (which increased the complexity of deployment as we had to install TDengine on client machines), it is still the best option.

Recently, we’ve noticed that TDengine R&D team updated the JDBC-RESTful connector to support cross-platform implementation. However, regarding Linux as the sole OS on our servers, we decided to keep with JNI. The graph below is a comparison between JDBC-JNI and JDBC-RESTful.

Connection Pool & ORM

As for the connection pool and Object-Relational Mapping (ORM), we decided to employ Druid+MyBatis, as taosdemo (TDengine testing scripts) can smoothly and efficiently connect to Druid+Mybatis. We only used MyBatis in queries (transform ResultSet into processable entities) while concatenating SQL statements for data ingestion.

Generally, TDengine presents powerful compatibility across popular frameworks, supporting HikariCP, Druid, Spring JDBC Template, MyBatis, etc. Meanwhile, taosdemo can greatly reduce testing costs. And notes & tips about connecting to taosdemo are listed in the documentation.

Clustering TDengine

Currently, there are three physical nodes (machines) in the TDengine Cluster with the settings of 16-core CPU+64GB Memory+1TB Disk. The procedures and instructions of TDengine Cluster Management are detailed in the documentation, hence it is as easy as pie to build up the clustering.

Create Database

In light of the research on TDengine, if there are three machines in the cluster and three replications for each data record (which means there is a replication of total data in every single machine), it will slow the storing & cashing efficiency of TDengine, so we decided to configure the parameter “replica” as 1 when creating a database. If scaling out the cluster is needed in the future, TDengine also supports dynamic alteration on the number of “replica”, which demonstrates the high availability of TDengine. Besides, we configured the parameter “blocks” as 16 and the parameter “cache” as 64MB for faster query speed.

CREATE DATABASE sentinel KEEP 365 DAYS 1 blocks 16 cache 64;

How TDengine Performs in POIZON

At present, TDengine runs stably to handle tens of billions of data records in the production environment. In the daily usage scenario, the CPU usage of TDengine is lower than 1% and the memory usage is lower than 25%.

The following graph shows the monitoring data of a single machine:

Writing Performance

Sentinel Dashboard is installed on a 4-core/16GB machine, here are the settings:

  • Maximum threads in the thread pool: 16
  • Maximum threads in the database connection pool: 20
  • Actual threads in use: 14

The maximum number of data records per insert is 400, the writing time for per insert is shown below:

According to the picture above, a multi-thread batch writing only takes 10ms on average, which is an impressive speed. If we optimize the configuration of the parameter “maxSQLLength”, the writing performance will be further improved.

Query Speed

As for the query speed, the following data is the response time of SQL queries from several typical usage scenarios (Each Stable contains tens of billions of data records):

  • last_row function: 8.6 ms, 8.8 ms, 5.6 ms
select last_row(*) from stable;
  • Query the total data of a sole application-resource within randomly 5 minutes: 3.4 ms, 3.3 ms, 3.3 ms
select * from table where ts >= '2021-01-01 19:00:00' and ts < '2021-01-01 19:05:00';
  • Query the average Passed QPS of a sole application-resource by a 2-min interval within 3 hours: 1.4 ms, 1.3 ms, 1.4 ms
select avg(pass_qps) from table where ts >= '2021-01-01 19:00:00' and ts < '2021-01-01 22:00:00' interval (2m);
  • Query the average Passed QPS by a 2-min interval within one day in a “group by” manner: 2.34s, 2.34 s, 2.35s
select avg(pass_qps) from stable where ts >= '2021-01-01 00:00:00' and ts < '2021-01-02 00:00:00' interval (2m) group by appid;
  • Query the average Passed QPS by a one-hour interval within three days: 2.17 s, 2.16 s, 2.17 s
select avg(pass_qps) from stable where ts >= '2021-01-01 00:00:00' and ts < '2021-01-03 00:00:00' interval (60m) group by appid;

Regarding the data presented above, TDengine manifests excellent query speeds in aggregated queries based on large amounts of data and queries on the total data within a small time interval. Moreover, the query performance has been further optimized in the newly released version of TDengine.

Storage Costs

In the new architecture, there are no data replications in Sentinel, the total data is stored in the three machines of TDengine Cluster. According to the computation, TDengine impressively increases the 10% compression rate of the metric data in Sentinel.

Conclusion & Suggestions

TDengine currently performs as a highly efficient time-series database (TSDB) in POIZON, while the advanced features of TDengine such as stream computing and built-in query functions are not needed in the existing usage scenarios. However, as an open-source time-series database, TDengine outperforms comparable peers in terms of writing & query performance.

Additionally, an astonishingly flat learning curve, amazingly low operations & maintenance costs, and surprising easiness to build clustering are three competitive features of TDengine. Also, The TDengine R&D team presents quick responses to technical issues. And the ongoing improvements in performance and stability with rapid iterations further consolidate our confidence in TDengine.

The detailed documentation of TDengine has also impressed us. We can not only find technical instructions but also learn practical knowledge (architecture design, technical design, cluster management methods, elaborated technical details, etc.) from use cases and blogs, which is really helpful for us to acquire insights into TDengine.

We will keep informed of the release notes of TDengine and focus on exploring the usage scenarios suitable for TDengine in POIZON. In the end, we are looking forward that TDengine could continuously enhance its performance and stability as well as provide more flexible solutions for different usage scenarios.

(All data presented in this case is based on two versions of TDengine: 2.0.7.0 & 2.0.12.1)