High Availability at Low Cost: Dual-Replica Mode and Active-Active Deployment

Chait Diwadkar

November 26, 2024 / Engineering

While availability and reliability are critical for data infrastructure, they do come at a cost. A typical TDengine HA cluster consists of three nodes using the Raft consensus algorithm with three replicas of the stores time-series data. This provides high availability and ensures data reliability without the need for an arbitrator, but some use cases demand a less resource-intensive solution. TDengine Enterprise includes two solutions for customers looking to achieve high availability at a lower cost: dual-replica mode and active-active deployment.

Dual-Replica Mode

TDengine offers an arbitrator-based dual-replica solution that enables fault tolerance, provided that only one node fails at a time and failures are not continuous. Compared with three-replica deployments, dual-replica mode reduces hardware costs while ensuring a certain level of high availability. In a dual-replica database, each vgroup only contains two vnodes. If one vnode fails, the mnode verifies the status of data synchronization and determines whether the other vnode can independently provide services. This mechanism ensures no data is lost and allows ingestion and query operations to continue during a single-node failure.

The dual-replica solution suits customers with the following requirements:

Lower storage costs
Reduced number of physical nodes
Slightly lower requirements on high availability

From a technical perspective, the dual-replica solution has the following features:

Replica count: The cluster must include three or more nodes, but there are only two replicas of the time-series data.
Automated failover: If the leader node fails, the system switches automatically to a secondary node, ensuring data integrity and continuous ingestion and queries.
Leader election: Leadership arbitration is handled by the mnode, and not determined within the Raft group.

Dual-replica mode includes an arbitrator that assigns a new leader vnode in the event that the existing node fails. The new leader is chosen based on the synchronization status of the data in the cluster. This ensures system reliability and availability, even when one replica node fails.

Active-Active Deployment

Active-active deployment is ideal for resource-constrained environments where only two servers can be deployed. It can also be used to implement disaster recovery scenarios requiring two TDengine clusters in different locations. Although two nodes are the minimum, there is no upper limit on the number of nodes in an active-active deployment.

The technical mechanisms that enable TDengine to provide high availability and reliability in active-active mode are described as follows:

Client Driver Failover: TDengine automatically switches to the secondary node if the primary node fails, ensuring uninterrupted service.
Data Replication: TDengine continuously replicates data between the primary and secondary nodes, with special markers in the write-ahead log (WAL) to identify replicated data.
Data Filtering: The read interface for data subscription filters out data with special markers to avoid duplication or infinite loops.

Note that active-active deployments support only the JDBC connectors in WebSocket connection mode. Both nodes in the dual-active setup must be identical, with the same database names and configuration parameters, to ensure proper operation.

Key Differences

While both solutions aim to improve data reliability and service availability, they differ significantly in architecture and application. Understanding these differences is critical for choosing the right solution and achieving best practices. These differences are described as follows:

Cluster Architecture: Active-active deployment requires two independent clusters with flexible node counts, while dual-replica mode requires a single cluster with at least three nodes.
Synchronization: Active-active deployment uses data subscription to enable data synchronization, typically resulting in a synchronization delay of several seconds, while dual-replica mode relies on the arbitrator, which does not incur a delay.
Data Reliability: Active-active deployment cannot prevent data loss if the secondary node fails for a time exceeding the retention time of the WAL, whereas dual-replica mode can ensure no data loss.
High Availability: Active-active deployments can operate as long as one node is online, while dual-replica mode may fail to provide services if only one node remains after consecutive failures.

Best Practices

Dual-Replica:

Deploy at least three nodes, with one node designated for arbitration and not storing data replicas to reduce resource usage.
Upgrade from a single-replica cluster by increasing the node count to three or more and modifying the replica settings.

Active-Active:

Complete all database and table creation operations immediately after system startup to avoid metadata inconsistencies during failovers.

Conclusion

Both the active-active deployment and dual-replica mode solutions provide users with more robust data storage and management capabilities. The dual-replica solution is perfect for cost-sensitive scenarios, while the active-active solution offers flexibility and strong disaster recovery capabilities. These features can help you leverage TDengine’s technological advantages to build reliable and efficient database systems, enhancing business continuity and data security while laying a solid foundation for digital transformation.

Chait Diwadkar

Chait Diwadkar previously worked as Director of Solutions Engineering at TDengine.