HiThink RoyalFlush is a powerful financial information network that provides online financial data, data analysis software, and other financial software systems. RoyalFlush needs to handle tons of stock market data from different stock exchanges around the world. Given the requirement to store and process large volumes of data (sometimes historical data might be involved) each day, the database management system (DBMS) plays a pivotal role in our portfolio management system (PMS). At first, we employed Postgres+LevelDB as a storage solution, while an upgrade solution was in need as various bottlenecks emerged in the actual usage scenario.
Through research on solutions provided by industry-leading time-series databases (TSDB), we selected TDengine as our new storage solution. After the upgrade, the stability, storage efficiency, and query performance of our data monitoring system have been greatly improved.
Background & Bottlenecks
RoyalFlush Private Equity House is an intelligent PMS platform that offers diversified financial services (asset management, real-time stock market data monitoring, risk analysis, performance analysis, report generation, etc.) to securities brokers, investment funds, and private equity funds.
Our PMS platform implements stock market monitoring or analytical functions based on processing real-time data and historical data.
In these cases, real-time data is mainly used for asset monitoring. Considering the requirement to monitor hundreds of different underlying assets in a real-time approach, the data refresh frequency is around once per second. Thus, the whole system must have high-efficiency reading/writing performance and very low latency.
Additionally, historical data (at the daily level) is commonly used for portfolio management analysis. Historical data analysis involves a larger number of underlying assets than real-time asset monitoring and the time scope can span up to 10 years. Moreover, various analytical indicators and models are integrated into the output analysis report. Throughout the analysis process, a huge amount of data sets are involved. This in turn places higher demands on the reading and writing performance of DBMS.
Historical data analysis focuses on multiple securities varieties, including stocks, funds, Hong Kong stock market, US stock market, futures, and options. The time scope of which might range from days to years, while the interval between analytical report outputs shall be within 5 seconds, rendering multiple computations happening at the same time. Therefore, high concurrency in the report generation module is an issue that might affect the performance of the PMS system, causing unfriendly user experiences.
By examining the market data processing pipeline, there are two bottlenecks in the current solution:
- Excessive External Dependencies & Poor Reliability: PMS provides analytical services for multiple securities varieties (the granularity ranges from minute level to daily level or performing real-time analysis). The data acquisition of the whole system has a heavy reliance on HTTP, Postgres, and LevelDB. Additional data links will undermine the reliability, and furthermore, excessive external dependencies will also complicate queries in actual use.
- Insufficient Performance: PMS requires high-efficiency data acquisition to satisfy diversified analytical requirements. Postgres+LevelDB as a solution consumed too many computing resources.
Research & Solutions Comparison
Considering the pain points in the usage scenario, we are in urgent need of a solution upgrade. Thus, our team conducted a study on the most dominant TSDBs and obtained some insights as follows.
- ClickHouse: High operations and maintenance (O&M) costs, complicated scaling design, and high resource consumption
- InfluxDB: Specialized for time-series data and well-founded use cases in data monitoring and IoT scenarios while no open-source clustering function
- TDengine: High performance, flat learning curve, lower O&M costs, horizontal scalability, and high availability design
Given the results of the comparison, we decided to choose TDengine as a new storage solution for our PMS platform.
As all stock market data is timestamped data, TDengine, an industrial-grade TSDB tuned for time series data, presents excellent compatibility in our usage scenario. In light of the official performance test report, TDengine is way faster than InfluxDB in terms of writing speed. Moreover, TDengine supports various APIs based on C/C++, Java, Python, Go, and RESTful.
It takes two steps to write data to TDengine:
- Performing data cleansing to remove incompatible data types
- Using a script to migrate historical data in .csv format and employing Python script to load incremental data
Usage Scenario & Data Modeling
To achieve better performance, TDengine requires to design schema before data writing.
Stock market data in RoyalFlush PMS platform shows the following characteristics.
- Structured data with timestamps
- Rarely update or delete data
- Less and relatively fixed tag columns
- Relatively smaller data volume and fewer fields in each data record
The characteristics of tick data:
- Large data volumes, over 20 million data records per day
- Persistent data retention (over years of data records need to be retained)
The characteristics of daily data:
- Over 200,000 subtables
- Writing 200,000 data records per day
- Persistent data retention (recent 30 years of data records should be retained)
Given the characteristics mentioned above, we constructed a data model as shown below.
Following the data modeling template of TDengine, we created different databases for different types of data sets. Afterward, we configured different parameters to create STables for different product categories (stock, index, bond, fund, etc.) :
- Creating STables based on the existing portfolio typology to facilitate further aggregations
- Defining the tags of underlying assets as the tag values of STable
Challenges in Field Application
PMS requires acquiring data in a stable and efficient manner, so it is a necessity to implement high-efficiency queries, faster data updates, and simple O&M in the actual usage scenario.
Here are several difficulties we encountered in the field application:
- Data Writing: When we were performing historical data migration in .csv format, we failed to execute automatic table creation in TDengine, so we asked for help from the TDengine development team and obtained feedback that importing .csv file could not be supported when performing automatic table creation. Finally, we decided to create table schemas at first and then import .csv file.
- Querying Problem: Querying via HTTP in TDengine is implemented by querying a certain server, while it will expose the whole system to potential risks. First, sending all query requests to a certain server might cause overloading; second, it cannot guarantee the high availability of query services. Thus, we configured multiple HTTP interfaces to handle this problem when building a TDengine cluster.
- Storage Structure: Data types and data volumes have a significant impact on the performance of TDengine, so storage structure should be tuned in light of data attributes in different usage scenarios, e.g. the number of tables, data length, and copies (replica). To optimize the performance, parameters like blocks, caches, and ratioOfQueryCores need to be carefully configured. Following the instructions of TDengine developers, we defined a storage model that fits our usage scenario.
High Efficiency Brought by TDengine
After data migration, TDengine demonstrates high and stable performance as expected.
- In light of the comparison chart below, we can see the high efficiency brought by TDengine.
- Stability: based on the condition of performing queries by 400,000 times, the error rate was 0.01% in Postgres+LevelDB while which is lower than 0.001% in TDengine.
Note: Connecting to TDengine via Restful-JDBC will limit the max rows of data querying (default max rows: 10,240), which can be resolved by configuring the parameter “restfulRowLimit”.
From the perspective of big data monitoring, TDengine demonstrates dominant competency in terms of O&M costs, writing/reading performance, and technical support.
TDengine development team offered professional and efficient support during the whole research & deployment process, here we wanna deliver our very gratitude!
We will continue to grow with the TDengine community and contribute more to this project. Besides, we hope TDengine development could focus on improvements like:
- More diversified SQL-like syntax
- A rolling update
- User-defined aggregations
- Faster data migration
In our future plans, we will use TDengine in the following scenarios:
- The migration of tick data & minute-level market data
- Our online Apps
- The use of user-defined aggregations to deal with minute-level and day-level market data
- Real-time stock market data processing