Five Years of Perseverance: My Challenging but Rewarding Journey with TDengine

Jeff Tao
Jeff Tao
/
Share on LinkedIn

Five years ago, I made a pivotal decision: to open-source the core components of TDengine, a time-series database designed for Industrial IOT that our team had been developing for over two years. To my surprise, this move quickly gained significant attention from developers, and TDengine topped GitHub’s global trends multiple times. Today, TDengine boasts over 23,000 stars, 4,800 forks, and 590,000 installations across more than 60+ countries worldwide.

These numbers are particularly exciting for me, a developer in his 50s who’s still coding at the forefront. A large user base is the greatest reward for a programmer’s work, because it proves that our code and our effort have provided real value to the world. On this fifth anniversary of TDengine becoming open-source software, I reflected on the journey that TDengine has been on since that time in order to share my experiences with fellow developers and entrepreneurs.

Focusing on the Time-Series Data Niche Market

In March 2016, a landmark event occurred in the tech world when Google’s AlphaGo defeated world-class Go player Lee Sedol 4-1. This victory by AlphaGo ignited a global surge of interest in artificial intelligence. Among the major applications of AI is autonomous driving, which depends on real-time computations and decisions made based on data collected by vehicles. This data, timestamped and gathered at high frequencies, has led to an exponential increase in data volume.

By 2016, various modes of transportation—such as bicycles, cars, and trucks—had become connected or were gearing up to be, and ride-sharing was gaining popularity. These connected vehicles continuously collect data, primarily in the form of time-series data. As the transportation industry enters the era of mobile internet and AI, it is experiencing exponential growth in data volume.

Similarly, the rise of clean energy technologies such as solar, wind, and energy storage—driven by technological innovation and government support—has led to a significant increase in power grid equipment. Clean energy sources often produce unpredictable power output, posing substantial challenges for grid management. Addressing these challenges requires real-time data collection, computation, and decision-making at every stage, from power generation and transmission to distribution and consumption—once again involving time-series data. Additionally, the ability to sell excess energy back to the grid has created a real-time electricity trading system, transforming the power grid into a distributed energy network that demands real-time operational support.

In 2016, after leaving my previous startup, I had the opportunity to analyze these industry shifts. I recognized that both the transportation sector and distributed energy systems would generate vast amounts of time-series data. General-purpose databases and big data platforms were not equipped to efficiently handle the scale of this data, necessitating specialized time-series data processing tools. Beginning in September 2016, I started researching time-series data processing. I soon encountered time-series databases like InfluxDB, OpenTSDB, and Prometheus, but found that they had limitations in processing efficiency, horizontal scalability, and usability. Based on my previous startup experience and intuition, I saw a significant opportunity in this niche market. Thus, by October 2016, I was fully committed to researching time-series databases and wrote the first line of TDengine’s code on December 17, embarking on my third entrepreneurial journey.

Technical Innovation is the Core of the Product

In the competitive market of time-series databases, technical innovation is essential for any product to stand out. Analyzing use cases in the power and automotive industries revealed distinct characteristics of time-series data. For instance, data from sensors or devices is structured and continuous, resembling a data stream, and being updated or deleted only rarely. Users are more interested in trends over time than in specific values at individual points. By utilizing these characteristics, we developed a highly efficient time-series data processing engine.

Given that each sensor or device generates a data stream, the optimal modeling approach is to use one table per data collection point. For example, with ten million smart meters, you would need ten million tables. This approach simplifies data writing to simple append operations and employs column based storage, as data from the same sensor changes slowly, resulting in high data compression ratio. To address the challenge of managing a vast number of tables, I introduced the concept of a “supertable.” This approach uses a single supertable for each device category, then for each individual device, a table is created using the supertable as the template with a set of tags. Tags and time-series data are stored separately, efficiently handling the issue of numerous tables.

With this unique “one table per data collection point” data model and the innovative supertable concept, TDengine outperforms popular databases like InfluxDB and TimescaleDB in data ingestion, querying, and compression. According to the globally recognized TSBS benchmark, TDengine excels in both CPU-only and IoT scenarios (see detailed test reports on the TDengine website).

However, superior performance alone is not enough; innovation in product functionality is also crucial. After analyzing time-series data applications, I determined that features such as caching, data subscription, and stream processing should be integrated into the database to create a comprehensive time-series data processing platform. This integration simplifies system architecture and reduces operational costs. Consequently, we named our product TDengine—Time-Series Data Engine—highlighting its role as an all-encompassing time-series data engine. By leveraging the specific characteristics of time-series data, our caching, data subscription, and stream processing functions outperform general-purpose tools like Redis, Kafka, and Spark, delivering higher performance and lower resource consumption, further reducing operational costs.

Ease of use is also paramount. From the moment I wrote the first line of code, I decided to use SQL as the standard query language, unlike InfluxDB, Prometheus, and OpenTSDB, which use custom query languages. We also prioritized a seamless installation and startup process, ensuring setup in under 60 seconds. All example code is designed to be copy-pasted and operational, minimizing learning costs.

Open-Sourcing the Core Code

How to promote such an innovative product? Especially for fundamental building block software like databases, switching costs are high, and convincing developers to switch without a compelling reason is challenging. Thus, we decided to open-source TDengine. Despite lacking open-source experience, after releasing our first official version and securing three major customers, we dedicated ourselves to preparing for the open-source release starting in March 2019.

On July 12, 2019, at the ArchSummit in Shenzhen, I officially announced the open-sourcing of TDengine’s single-node version. The product’s positioning met the stringent demands of IoT and industrial internet data platforms, and its impressive performance and user experience led to a surge in GitHub stars and forks. The project even reaching the top of the GitHub global trending charts multiple times. Website traffic skyrocketed as well. Within three months of open-sourcing, GitHub stars surpassed 10,000—far exceeding our expectations. Our small team of six had sparked a significant market interest.

When I decided to open-source TDengine, I believed that the only way to truly capture the hearts of developers was to open-source the most crucial part of our code. I wanted to provide real value by fully showcasing our technical innovations and expertise. However, out of concern that open-sourcing might not succeed, we initially withheld one key feature: the clustering functionality. After witnessing the tremendous success of the single-node version and receiving significant user feedback requesting clustering capabilities, we decided to go ahead and open-source the cluster version. Following extensive preparations, we released the cluster version in August 2020. This decision proved correct, with the cluster version receiving even more enthusiastic support from the developer community. GitHub stars continued to rise, with daily installations exceeding 200 and code clones surpassing 1,000.

Recognizing the future of cloud-native solutions, we developed a cloud-native version, which we open-sourced in August 2021. This version was well-received by developers again. Today, TDengine has over 23k stars, 4.6k forks on GitHub, with daily installations exceeding 500 and a total of 590,000 installations globally. The growth trend suggests that TDengine will soon become the de facto standard in time-series databases.

As a five-year-old open-source project achieving such significant installations and GitHub stars, I am immensely proud. It shows that our relentless development has delivered value. A large user base is the greatest reward for a programmer. TDengine continues to evolve, and we plan to open-source more modules. Our open-source principle remains: to release the most valued and core functionalities.

TDengine's Open Source Journey

Commercial Success is Key to the Continuous Success of Open-Source

For a business to survive, profitability is essential. We cannot rely solely on the developers’ passion without economic returns. Therefore, alongside our open-source success, we have explored paths to commercial success. After some research, we decided to offer TDengine Enterprise, a paid enterprise edition, following standard practices for open-source software.

While TDengine’s core code, including the cluster and cloud-native features, is open-source, TDengine Enterprise includes additional features. These include data backup, disaster recovery, access control, security, multi-level storage, and seamless integration with various data sources. Without these features, TDengine is fully functional and superior to other open-source time-series databases in both functionality and performance. However, these auxiliary functions are crucial for enterprise operations.

TDengine is widely used in IoT and industrial IoT scenarios, with various data sources such as MQTT, OPC-UA, OPC-DA, and traditional data historians like PI System and Wonderware. Our enterprise edition includes a component that allows for seamless data integration from these sources into TDengine—without writing even a single line of code. Since each data source may have different naming conventions, measurement units, and time zones, TDengine Enterprise includes capabilities for data transformation, filtering, and cleansing to ensure the quality of the data being stored. This dramatically simplifies the complexity of system deployment.

In enterprise applications, database backup, disaster recovery, and real-time synchronization are critical for data security and operational readiness. Without these features, data security cannot be guaranteed, and enterprises would be hesitant to put them into operation. Therefore, TDengine’s enterprise edition provides these features. Additionally, with the rise of edge computing, our enterprise edition offers edge-cloud synchronization, allowing data from edge devices to be synchronized with private or public clouds easily.

Data security is also vital. The enterprise edition includes encryption for data transmission, storage, and access control features such as IP whitelisting and operation auditing. TDengine also offers view permissions, allowing fine-grained control over data access at the table, column, and time interval levels. Data subscriptions allow access to specific tables, columns, and time periods defined through SQL, with the ability to process or aggregate raw data, all while incorporating role-based access control. This is all designed to ensure the highest level of data access security.

With exponential growth in data volume, storage costs are a major concern. Therefore, TDengine enterprise provides tiered storage, categorizing data by age, from memory for new data to S3 for old data, minimizing storage costs.

Since Sepetember 2022, TDengine has also offered TDengine Cloud, a fully managed cloud services on major platforms like AWS, Azure, and GCP. TDengine Cloud provide a fast, high-quality, and cost-effective solution for small and medium-sized enterprises, with scalable resources and professional services.

Moreover, we firmly believe that the future of open-source software lies in cloud services. By open-sourcing, we can quickly build a market brand and establish a developer community, leading a significant number of users to transition directly to our cloud services.

TDengine’s commercial success validates the open-source approach. Offering a free core product and additional paid features, as well as providing cloud services, allows TDengine to achieve a win-win scenario for both developers and businesses.

Maximizing Data Value

TDengine is fundamentally a time-series database designed to help users efficiently store and process time-series data collected from various sources. It cleanses and prepares the data, offering powerful SQL-based querying, analysis, and real-time data distribution services. Regardless of the scenario, the ultimate goal of data collection and storage is to extract valuable insights—whether it’s for real-time operational monitoring, immediate anomaly detection and alerts, forecasting future trends, or predictive maintenance of equipment. TDengine is dedicated to helping users maximize the value and utilization of their data.

TDengine’s built-in query and computation engine offers extensive data analysis capabilities, supporting standard SQL, nested queries, user-defined functions, and a wide range of time-series-specific functions. However, to ensure users can fully unlock the value of their data, TDengine seamlessly integrates with a variety of BI, AI, and visualization tools—such as Power BI, Tableau, and Grafana—via standard JDBC and ODBC interfaces. This allows users to choose their preferred tools for analyzing and processing the data stored in TDengine.

As real-time data analysis becomes increasingly crucial, TDengine offers advanced stream processing capabilities with support for various windowing mechanisms, including sliding, state, event, session, and count windows. Additionally, TDengine provides flexible and secure real-time data subscription capabilities to further enhance real-time computation. Whenever there is an update to the subscribed data, third-party applications receive instant notifications, enabling real-time processing and maximizing data value.

To facilitate application development, TDengine offers connectors for a wide range of popular programming languages, including C/C++, Java, Python, Rust, Go, and Node.js, along with ready-to-use example code for various functionalities.

As we move further into the AI era, new algorithms, models, and data analysis tools are constantly emerging, and no single vendor can cover all these new tools comprehensively. TDengine addresses this challenge by offering open interfaces that ensure seamless integration with these new tools and platforms, helping users fully realize the potential of their data.

TDengine's ecosystem

Reflecting on the Journey

Seven years have passed since I wrote the first line of code for TDengine. The 49-year-old programmer back then is now 56. Instead of enjoying a leisurely life, I chose to embark on my third entrepreneurial journey, diving back into coding with renewed vigor. This journey has allowed me to channel the energy of my younger years into a new endeavor and continue pursuing the dreams of my youth. I’m heartened to see TDengine’s daily installation rate continue to grow and to witness our product being embraced and loved by an increasing number of users. Our commercialization efforts have been largely successful, with over 200 paying customers across industries such as utility, renewable energy, automotive, oil & gas, petrochemicals, mining, and smart manufacturing, with clients across the world.

Since AlphaGo’s milestone achievement in 2016, the advances in artificial intelligence, exemplified by ChatGPT in 2023, have underscored the growing importance of data infrastructure. As data volumes continue to surge exponentially, the value of data infrastructure becomes increasingly apparent. With most of this data coming from machines, devices, and sensors, the significance of the time-series data processing market is set to grow. As traditional databases and big data platforms struggle to meet performance, scalability, and cost requirements, TDengine is poised to shine.

I am grateful for the decision I made in 2016—to take on this challenging yet rewarding endeavor. TDengine is a product with a strong demand, high technical barriers, long-term commitment, and immense growth potential. I am also thankful that we chose to open-source our core code, enabling us to achieve over 560,000 installations across more than 60 countries and regions worldwide in just five years. This journey is a marathon, and while we have completed five years, my focus remains on leading the team forward, aiming for TDengine to become the de facto standard for time-series big data platforms.

Doing something hard but right is a choice I will never regret.

  • Jeff Tao

    With over three decades of hands-on experience in software development, Jeff has had the privilege of spearheading numerous ventures and initiatives in the tech realm. His passion for open source, technology, and innovation has been the driving force behind his journey.

    As one of the core developers of TDengine, he is deeply committed to pushing the boundaries of time series data platforms. His mission is crystal clear: to architect a high performance, scalable solution in this space and make it accessible, valuable and affordable for everyone, from individual developers and startups to industry giants.