One of the characteristics of time series data is that it is particularly suitable for compression.
The purpose of compression is to represent the given data in as few bits as possible. In essence, this process trades compute time for storage space – while data compression enables more data to be stored on a storage medium, more compute resources are required to read and write compressed data.
There is no single compression method or algorithm that is optimal for all data; instead, an appropriate method is chosen based on the characteristics of the data being compressed.
Data compression is a summary of the trends and rules found in data. These rules can be based on content – the similarities between adjacent frames in a video, for example; on representation, as in entropy coding and transform coding; or on bitrate, as in differential compression and deep compression.
If the original data and decompressed data are exactly the same, the compression method can be considered lossless. A compression method that alters data is considered lossy.
- Losslessly compressed data can be completely reconstructed after decompression into the original raw data. This compression method is used in scenarios where data accuracy is important, such as compressing entire hard drives or compressing executable files. It may also be used for multimedia compression. However, lossless methods have relatively low compression ratios. Common lossless compression methods include differential coding, run-length encoding (RLE), Huffman coding, LZW compression, and arithmetic coding.
- On the other hand, with lossy compression, the compressed data cannot be reconstructed into the original data – only into an approximation of that data. This compression method is used in scenarios where data accuracy is not important, such as compressing multimedia files. However, lossy compression enables high compression ratios. Common lossy compression methods include predictive codecs, fractal compression, wavelet compression, JPEG, and MPEG.
Importance of Data Compression
Modern datasets grow quickly in size, requiring increasingly large storage systems to contain them. The importance of data compression is in saving storage resources and reducing the associated costs. Time series compression is particularly important due to the massive scale of datasets combined with the highly compressible nature of the data, making it a smart investment.
Time Series Compression
Commonly used compression methods for time series data are described as follows:
- Huffman coding:
This is one of the most widespread compression schemes, first described in the 1850s by David Huffman in his paper “A Method for Constructing Minimal Redundancy Codes”. Huffman coding works by finding the optimal prefix code for a given alphabet. It should be noted that a prefix code here represents a value, and it is necessary to ensure that the prefix code of each symbol in the alphabet does not become the prefix of another symbol prefix code. For example, if 0 is the prefix code for our first symbol, A, then no other symbol in the alphabet can start with 0. This mechanism is also useful because the prefix code makes decoding of the bitstream unambiguous.
- Run-length encoding (RLE)
A lossless data compression technique independent of the nature of the data, based on “replacing the original data that recurs continuously with a code of varying length” to achieve compression. For example, a set of string “AAAABBBCCDEEEE” consists of 4 A , 3 B , 2 C , 1 D , and 4 E. After RLE, the string can be compressed into 4A3B2C1D4E. Its advantages are simplicity, high speed, and the ability to compress continuous and highly repetitive data into small units. The disadvantage is also obvious, the data compression effect with low repetition is not good.
This algorithm is a specific algorithm designed in combination with the data characteristics that follow the IEEE754 standard floating-point storage format. The first value is not compressed, and the following values are the result of XOR (exclusive OR) with the first value. If the result is the same, only the storage A 0 to store the XORed result if the result is different. The algorithm is greatly affected by data fluctuation, the more severe the fluctuation, the worse the compression effect.
Differential encoding is also called incremental encoding. The first data is unchanged during encoding, and other data are converted into the delta of the previous data. The principle is similar to XOR, both of which are to calculate the difference between two adjacent data. This algorithm is widely used, such as when you need to see the history of changes to a file (version control, Git, etc.). However, this algorithm is rarely used alone in time series databases, and is generally used with RLE, Simple8b or Zig-zag, and the compression effect is better.
Also known as second-order differential encoding, Delta encoding is used again on the basis of Delta encoding, which is more suitable for encoding monotonically increasing or decreasing sequence data. For example, 2,4,4,6,8 , 2,2,0,2,2 after delta encoding, and 2,0,-2,2,0 after delta encoding. Usually it is also used with RLE, Simple8b or Zig-zag.
Zig-zag is to solve the problem of low coding efficiency for negative numbers by the varint algorithm. Its principle is very simple. It moves the flag bit to the end and removes the extra 0 in the coding, so as to achieve the effect of compression. For relatively small values, the compression efficiency is very high, but for large data, the efficiency may not be improved but may also be reduced. Therefore, Zig-zag is often paired with Delta encoding, which works well to convert large numerical data into smaller ones
The Snappy compression algorithm draws on the idea of the LZ77 algorithm. Due to the high time complexity of the pattern matching process in the LZ77 algorithm, Google has made many optimizations for it and opened it up in 2011. The principle is to assume that we have a sequence S=[9,1,2,3,4,5,1,2,3,4], and find the subsequence S~2,5~=[1,2,3, 4] is the same as S~7,10~=[1,2,3,4], so the sequence is encoded as S=[9,1,2,3,4,5,6,(2,4 )], 2 is the start position, and 4 is the number of digits. Snappy has the advantages of fast speed and reasonable compression rate, and is widely used in many open source projects, such as Cassandra, Hadoop, MongoDB, RocksDB , Spark, InfluxDB , etc.
The LZ4 data compression algorithm belongs to the byte-oriented LZ77 compression scheme family, the compression ratio is not high, and it is characterized by extremely fast decoding speed. According to the test results of the official test benchmark lzbench , the decompression speed is as high as 4970MB/s under the default configuration.
Simple8b is a 64-bit algorithm that implements compressing multiple integers (between 0 and 1<<60-1) into a 64-bit storage structure. The first 4 bits represent the selector, and the last 60 bits are used to store data. The advantage is that it is simple and efficient, and the fixed-length encoding ensures the decompression efficiency, but the compression rate is low for large integers or large floating values. This algorithm is suitable for unsigned integers with a small range.
LZO is a block compression algorithm that also belongs to the LZ77 family of compression schemes. The goal of this algorithm is fast compression and decompression, not compression ratio. In contrast, LZ4 decompresses faster. Because the data types stored in the block may be varied, the overall compression effect is far less good than the compression algorithm for a certain data type.
DEFLATE is a classic lossless data compression algorithm that uses both the LZ77 algorithm and Huffman Coding. In fact, DEFLATE is just an algorithm for compressing data streams. This algorithm is the default algorithm for zip file compression, and it is encapsulated in algorithms such as gzip and zlib .
Zstandard ( Zstd ) is designed to provide a compression ratio similar to the DEFLATE algorithm, but faster, especially for decompression. Its compression level can be adjusted from minus 5 (fastest) to 22 (slowest compression, but the highest compression ratio). In text log compression scenarios, its compression performance is comparable to or even better than LZ4 and Snappy. Linux kernel, HTTP protocol, Hadoop, HBase, etc. have all added support for Zstd . It is foreseeable that Zstd will be a compression algorithm that will be widely concerned in the next few years.
- Bit packing
The Bit-packing compression algorithm removes unnecessary bits from the data we want to compress, based on the premise that not all integers need 32-bit or 64-bit storage. For example, a 32-bit integer data whose value range is between [0,100] can be represented by 7 bits.
TDengine time series database (TSDB) provides three compression options: no compression, one-stage compression, and two-stage compression.
The one-stage compression is correspondingly compressed according to the type of data. The compression algorithms include delta-delta encoding, simple 8B method, zig-zag encoding, LZ4 and other algorithms, which we have briefly introduced above. Two-stage compression is based on one-stage compression, and then compresses with a general compression algorithm to ensure a higher compression rate.
To sum up, we can see that there are still many choices of compression algorithms, and the specific program can be comprehensively selected according to the compression effect and cost.