Performance: TDengine vs Cassandra

Avatar
TDengine Team
/

Abstract:In this test, TDengine is compared with Cassandra in the terms of writing throughput, query throughput, aggregation query response time and on-disk compression. The results demonstrates that TDengine outperforms Cassandra with 20x greater write throughput, 17x larger data ingestion, 4000x faster in aggregation query (2500x when grouping by tags and 119x when grouping by time) while using 26.7x less disk space.

About the hardware

Servers and the testing program are running on the same Dell desktop of the model type ” OptiPlex- 3050″, with  4 cores and 8G memory. Detailed configurations are as follows:

  • OS: Ubuntu 16.04 x64
  • CPU: Intel(R) Core(TM) i3-7100 CPU @ 3.90GHz
  • Memory: 8GB
  • Disk: 1TB HDD

About the dataset

1.data description

Two popular datasets were investigated before this test:

  • New York taxi running information:  cannot design the data model becuase there is no information for the single car
  • faker generation tool: unfit for the IoT scenario because only strings can be generated.

To make this test repeatable, we write a specific data generation program, which simulates the temperature (int) and humidity (float) meausrements from thermohydrometers. Each thermohydrometer has three tags: device ID, device group, and device name. To make the simulation look authentic, the dataset distributes normally, not randomly.

All measurements are sampled every 1 second, containing 10k devices with 10k records from each device. There are 3 tags, 2 numeric values and 1 timestamp in each record.

2.data generatation code

The data generation code is written in Java, which can be downloaded here and executed as follows:

cd tests/comparisonTest/dataGeneratorjavac com/taosdata/generator/DataGenerator.java 

3.options for the data generation code

  • dataDir : filepath to store output data files
  • numOfFiles : number of output data files
  • numOfDevices : number of devices
  • rowsPerDevice : number of records from each device

4.data generation

Execute the following commands and then 100 data files would be created. Each data file includes measurements from 100 devices. Totally there are 10k devices with 10k records from each device.

mkdir ~/testdatajava com/taosdata/generator/DataGenerator -dataDir ~/testdata -numOfDevices 10000 -numOfFiles 100 -rowsPerDevice 10000

TDengine preparation

TDengine is an  open-source big data platform designed and optimized for Internet of Things (IoT), Connected Vehicles, and Industrial IoT. Besides the 10x faster time-series database (TSDB), it provides caching, stream computing, message queuing and other functionalities to reduce the complexity and costs of development and operation.

1.installation

  • Download tdengine-1.6.1.0.tar.gz.
  • Unzip and then run “install.sh” to install TDengine
  • Start TDengien by executing “sudo systemctl start taosd”
  • If the installation succeeds, enter “taos” in the terminal and then the following texts will be displayed:
Welcome to the TDengine shell, server version:1.6.1.0  client version:1.6.1.0Copyright (c) 2017 by TAOS Data, Inc. All rights reserved.
taos>

2.data model

For TDengine, a super table would be created for all devices in the same type and then one table for one device. Thus, for the super table, the data records include measurement time, temperature and humidity; the static device atributes in tags include device ID, device group, and device name.
SQL syntax for creating a super table

create table devices(ts timestamp, temperature int, humidity float) tags(devid int, devname binary(16), devgroup int);


SQL syntax for dynamically creating one table using the super table as template and insert one record

insert into dev1 using devices tags(1,'d1',0) values(1545038786000,1,3.560000);

3.testing code

The TDengine C driver is used to insert and query reocrds. In the future the testing code based on JDBCdriver will also be provided. Currently the testing code can be downloaded here. Enter the following commands in the terminal, an executable file “./tdengineTest” will be created:

cd tdengine
make

4.how to use the testing code

Writing Options

  • writeClients : number of client connections to insert data concurrently, default 1
  • rowsPerRequest : number of records in one requestnumber of records in one request, default 100
  • dataDir : data file path, same with the dataDir in the data generation code
  • numOfFiles : number of files read from dataDir

For example

./tdengineTest -dataDir ./data -numOfFiles 10 -writeClients 2 -rowsPerRequest 100

Query Options

  • sql: path of the files which store all SQL statements to executed.

For example

./tdengineTest -sql ./sqlCmd.txt

Cassandra preparation

Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a type of NoSQL database.

1.installation

echo "deb http://www.apache.org/dist/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -
sudo apt-get update
sudo apt-get install cassandra
  • Start Cassandra
sudo service cassandra start
  • If the installation succeeds, enter “cqlsh” in the terminal and then the following texts will be displayed:
Connected to Test Cluster at 127.0.0.1:9042.[cqlsh 5.0.1 | Cassandra 3.11.4 | CQL spec 3.4.4 | Native protocol v4]Use HELP for help.
cqlsh>

2.data model

Build a table named “test” in the keyspace “cassandra”, to which all devices belong. Each record includes six columns: timestamp (ms), temperature (int) and humidity (float), device ID (int), device group (int) and device name (string). Primary keys are device group, device id, device name and timestamp. Because in the Cassadra “where” clause can only filter primary key, all columns to be filtered in this test need to be primary keys. Meanwhile, in the Cassandra, “group by” clause can only aggregate primary keys orderly, the column to be grouped must be put in the first place of primary keys.

3.testing code

The Cassandra Java driver is used to insert and query reocrds. The testing code can be downloaded here.
First install Cassandra Java client (https://github.com/datastax/java-driver) provide by DataStax. Related dependency has been added in the pom.xml to use this Java client. 

4. how to use the testing code

Preparation


To avoid “timeout” error due to the slow write/query response, the default “timeout” value for Cassandra server and client need to be modified before this test.

  • server, enlarge all options related to “timeout” by 100-1000 times in the “/etc/cassandra/cassandra.yaml”
  • client, In the “application.conf” file under the folder “cassandra/”, client timeout default value has been modified. This file path needs to be provided in the test.

Writing Options

  • writeClients : number of client connections to insert data concurrently, default 1
  • rowsPerRequest: number of records in one request, default 100
  • dataDir : data file path, same with the dataDir in the data generation code
  • numOfFiles : number of files read from dataDir
  • conf: path storing the client configuration file

For example

cd cassandra/cassandratest/target

java -jar cassandratest-1.0-SNAPSHOT-jar-with-dependencies.jar -datadir ./data -numofFiles 100 -rowsperrequest 2000 -writeclients 4 -conf cassandra/application.conf

Querying Options

  • sql: path of the files which store all SQL statements to be executed.

For example

cd cassandra/cassandratest/target

java -jar cassandratest-1.0-SNAPSHOT-jar-with-dependencies.jar -sql cassandra/sqlCmd.txt -conf cassandra/application.conf

Write performance

One writing request can send one record or multiple records, denoting as “R/R” or “Records/Request”. The writing speed would increase with the “R/R”. Meanwhile, one database server can connect to many clients. The more connection, the larger write throughput. Thus, both one-connection case and multi-connection case would be tested.

1.TDengine

Writing tests are taken in multiple scanarios, including 1R/R, 100R/R, 500R/R, 1000 R/R and 2000R/R with different number of connections. You can change the options in the example and take different tests.

  1. clean up existing dataset
    drop database db;
  2. start testing
    The example command to read 100 data files in the ~/testdata and insert 1000 records per request by 5 clients:
    ./tdengineTest -dataDir ~/testdata -numOfFiles 100 -writeClients 5 -rowsPerRequest 1000

Write throughput is as follows,unit in records/second

R/R1 client2 clients3 clients4 clients5 clients6 clients7 clients
126824436995513762869645296864772277
100415800734484895522976085108790211710741192199
50047984688261210830321195100126919613642561417004
100050075191449411219141239157136798914181041476560
2000512820105552011741641306904142663514584341477208
TDengine time series database | 22.06 01

2.Cassandra

Writing tests are taken in multiple scenarios, including 1R/R, 10R/R, 50R/R, 100R/R, 500R/R and 1000R/R with different number of connections. You can change the options in the following example and take different tests.

  1. clean up existing dataset
    drop database Cassandra;
  2. start testing
    The example command to read 100 data files in the ~/testdata and insert 1000 records per request by 5 clients:
    java -jar Cassandratest-1.0-SNAPSHOT-jar-with-dependencies.jar -dataDir ~/testdata -numOfFiles 100 -writeClients 5 -rowsPerRequest 1000 -conf Cassandra/application.conf

Write throughput is as follows,unit in records/second

R/R1 client2 clients3 clients4 clients5 clients6 clients7 clients
13515492555295991633163806597
1035998355423512434135350773588636102
5031743494235162655752572825681555831
10038328503875451956940578535933561708
50030417362643807839066394593975839918
100021555252932622426559267652651126693
TDengine time series database | 22.06 02

3.Best Write Performance: TDengine vs Cassandra

Compare the best writting performance of TDengine and Cassandra. Results are as follows:

R/R1 client2 clients3 clients4 clients5 clients6 clients7 clients
TDengine512820105552011741641306904142663514584341477208
Cassandra38328503875451956940578355933561708
TDengine time series database | 22.06 03

Figure 3 demonstrates that the writing speed of TDengine is in the order of 1million records per second while that of Cassandra is in the order of 10~100k records per second. In conclusion, it writes about 20 times faster in the TDengine than in the Cassandra.

Read Performance

For reading performance, this test takes a simple tranversing query, that is, reading all the data having been written into the database. 

  1. how to use the TDengine testing code
    SQL expresssions are in the “tdengine/q1.txt”, for example, select * from db.devices where devgroup=0;
    Execute the following command:
    ./tdengineTest -sql ./q1.txt
  2. how to use the Cassandra testing code
    SQL expresssions are in the “cassandra/q1.txt”, for example, select * from devices where devgroup=0;
    Execute the following command:
    java -jar Cassandratest-1.0-SNAPSHOT-jar-with-dependencies.jar -conf Cassandra/application.conf -sql Cassandra/q1.txt

Reading speed is as follows,unit in second

LatencyG-0G-10G-20G-30G-40G-50G-60G-70G-80G-90
TDengine0.2350.2120.2180.2090.2100.2090.2090.2090.2160.208
Cassandra3.923.683.653.613.693.573.553.593.663.64
TDengine time series database | 22.06 04

Figure 4 demonstrates that the stable reading speed of TDengine is about 0.21s, that is, reading 5million records per second. Meanwhile, the stable reading speed of Cassandra is about 3.6s, that is, reading 300k records per second. In conclusion, the query throughput size in TDengine is 17 times larger than that in Cassandra.

Aggregation Performance

This part tests five aggregation function: COUNT, AVERAGE, SUM, MAX and MIN which are shared by TDengine and Cassandra. Each aggratation query would be paired with a filter to select 1/10, 2/10, 3/10, …, or all of the 100 devices.

1.TDengine

SQL expresssions are in the “tdengine/q2.txt”, for example,

select count(*) from db.devices where devgroup<10;


Execute the following command

./tdengineTest -sql ./q2.txt

Query response time is as follows,unit in second

10%20%30%40%50%60%70%80%90%100%
COUNT0.0180.0260.0160.0180.0170.0240.0240.0270.0300.033
AVG0.0070.0140.0150.0200.0240.0380.0440.0500.0570.060
SUM0.0060.0100.0190.0180.0310.0360.0340.0370.0430.046
MAX0.0070.0130.0150.0200.0250.0300.0350.0390.0450.049
MIN0.0060.0100.0160.0240.0320.0390.0450.0410.0430.049
SPREAD0.0070.0100.0150.0190.0330.0380.0460.0520.0590.066
TDengine time series database | 22.06 05

2.Cassandra

SQL expresssions are in the “cassandra/q2.txt”.

select count(*) from devices where devgroup<10;


Execute the following command:

Java -jar cassandratest-1.0-SNAPSHOT-jar-with-dependencies.jar -sql cassandra/q2.txt -conf cassandra/application.conf


Query response time is as follows,unit in second,

10%20%30%40%50%60%70%80%90%100%
COUNT33.7967.2387.64105.82131.52160.88188.70213.85240.39264.49
MEAN28.8857.8387.16114.87145.30173.32204.11235.33261.29290.97
SUM29.3558.1986.24115.56145.73173.81203.94234.15260.41292.51
MAX28.9457.8585.60115.02145.62175.08202.53232.61260.37288.46
MIN29.5858.2687.27117.22144.01174.20201.88235.98263.69290.27
TDengine time series database | 22.06 06

3.Comparison

Compare the query response time between TDengine and Cassandra based on the 1E8 records.

CountAverageSumMaxMin
TDengine0.0330.060.0460.0490.049
Cassandra264.49290.97291.51288.46290.27
TDengine time series database | 22.06 07

Figure 7 demonstrates that the response time of aggregation query in TDengine is within 100ms while the response time in Cassandra is about 200~300 seconds. In conclusion, the response time of aggregation query in the TDengine is more than 100 times shorter than that in the Cassandra.

Performance of Aggregation  grouped by tags

This part tests the aggregation performance grouped by tags. Each aggratation query would be paired with a filter to select 1/10, 2/10, 3/10, …, or all of the 100 devices.

  1. how to use the TDengine testing code
    SQL expresssions are in the “tdengine/q3.txt”, for example,
    select count(temperature), sum(temperature), avg(temperature) from db.devices where devgroup<10 group by devgroup;
    Execute the following command:
    ./tdengineTest -sql ./q3.txt
  2. how to use the Cassandra testing code
    SQL expresssions are in the “cassandra/q3.txt”, for example,
    select count(temperature), sum(temperature), avg(temperature) from db.devices where devgroup<10 group by devgroup;
    Execute the following command:
    java -jar cassandratest-1.0-SNAPSHOT-jar-with-dependencies.jar -sql cassandra/q3.txt -conf cassandra/application.conf

Query response time is as follows,unit in second,

10%20%30%40%50%60%70%80%90%100%
TDengine0.0300.0280.0310.0410.0690.0660.0770.0910.1020.123
Cassandra31.4062.2192.12122.01154.95185.03217.46249.59281.86308.89
TDengine time series database | 22.06 08

Testing results show that the response time of aggretaion query grouped by tags is 3000 times shorter in the TDengine than that in the Cassandra.

Performance of Aggregation grouped by time

This part tests the aggregation performane grouped by time. Each aggratation query would be paired with a filter to select 1/10, 2/10, 3/10, …, or all of the 100 devices.

  1. how to use the TDengine testing code
    SQL expresssions are in the “tdengine/q4.txt”, for example,
    select count(temperature), sum(temperature), avg(temperature) from db.devices where devgroup<10 interval(1m);
    Execute the following command:
    ./tdengineTest -sql ./q4.txt
  2. how to use the Cassandra testing code
    Because the limit of “where” and “group by” clause, data needs to be inserted into the database again, in which a new column “minute” is added and put to the first place of primary keys.
    Execute the following command:
    java -jar cassandratest-1.0-SNAPSHOT-jar-with-dependencies.jar -datadir ~/testdata -numofFiles 100 -rowsperrequest 2000 -writeclients 4 -conf cassandra/application.conf -timetest
  3. SQL expresssions are in the “cassandra/q4.txt”, for example,
    select count(temperature), sum(temperature), mean(temperature) from devices where devgroup<10 group by minute;
    Execute the following command:
    java -jar cassandratest-1.0-SNAPSHOT-jar-with-dependencies.jar -sql cassandra/q4.txt -conf cassandra/application.conf

Query response time is as follows,unit in second,

10%20%30%40%50%60%70%80%90%100%
TDengine0.2370.4720.6530.9021.1341.4221.7531.7842.0852.549
Cassandra131.35153.87169.40188.86203.47227.61250.41274.53294.87303.51
TDengine time series database | 22.06 09

Testing results show that the response time of aggretaion query grouped by time is 100 times shorter in the TDengine than that in the Cassandra.

On-disk Compression

1.Original dataset size

In this test 100 data files are generated and stored in the folder “~/testdata”, whose size can checked by command “du”

cd ~/testdata
du -h .
TDengine time series database | 22.06 10

2.Disk space using by TDengine

In TDengine all data are saved in the directory “/var/lib/taos/data” by default. Stop TDengine server before checking the data size.

sudo systemctl stop taosd

Then check the size of data in the folder “/var/lib/taos/data/” by command “du”.

cd /var/lib/taos/datadu -h .

Results are shown in Fig. 11.

TDengine time series database | 22.06 11

3.Disk space using by Cassandra

In Cassandra all data are saved in the directory “/var/lib/Cassandra/data/keyspace_name” by default. Stop Cassandra server before checking the data size.

sudo service Cassandra stop

Check the data folder size by command “du”. 

cd /var/lib/Cassandra/data/Cassandra
du -sh .

Results are shown in Fig. 12.

TDengine time series database | 22.06 12

4.Comparison of disk usage

Original test dataset occupies 3941MB in the disk, data in the Cassandra 12GB and data in the TDengine 459MB. The on-disk compression ratio in the TDengine is 26.7 times of that in the Cassandra.

In the real scenarios of IoT, the on-disk compression ratio of the TDengine is expected to be larger because of the limited spread of real measurements and the column-based storage of TDengine. 

Feature Comparison

Both TDengine and Cassandra can be used to process time-series data and they have some similar features. 

FeatureTDengineCassandra
SQL syntax
Private deployment
Scalability
System connection management
Query task management
Data import
Data export
Web management
Multi-layer storage
Telegraf data collection
Grafana data visualization
RESTful
C/C++
JDBC/ODBC
Go
Python
Database configuration
Replica configuration
Data alive time
Data partition
Streaming
Subscriber
Microsecond precision
Aggregation
Downsampling
Limit/offset
Interpolation
Data updated
Tag updated
Time delete
Data cleanup

Conclusion

In this test, TDengine is compared with Cassandra in the terms of writing throughput, query throughput, aggregation query response time and on-disk compression. Test dataset, codes and SQL expressions can be downloaded here. Thus, anyone can repeat this test again.


This test demonstates that TDengine outperforms Cassandra with 20x greater write throughput, 17x larger query throughput, 4000x faster in aggrgation query (2500x when grouping by tags and 119x when grouping by time) while using 26.7x less disk space.

TDengineCassandra
Write throughput1,477,208 rows/second61,708 rows/second
Time to ingest 1 million rows0.21s3.64s
Response time to average 100 million rows0.06s264.49s
Response time to average 100 million rows by tag0.123s308.39s
Response time to average 100 million rows by timestamp2.549s303.51s
Disk usage for 100 million rows459 MB12 GB