Last few days I tried out a bit on a compression practice. Here are some interesting thoughts.
Large files stored with compression are usually not efficiently compressed. This leave a lorge room of improvement for both cost and computation.The most common compression algorithm & practice now are mainly:
- gzip – most popular due to decent compress ratio, smaller memory footprint, and fastest in computational resource among three.
- bzip2 – better compression ratio than gzip, but has taken more memory on multiple compress level and slower in general
- 7zip (algorithm LZMA) – not commonly distributed on unix-like os, and memory usage grows exponentially with compress level as well as time consumed, but given best compress ratio so far.
Large established compress algo and implementation takes the compression problem in general. It treats all raw data indifferentlly adopting same compressing rules.
But compression in core is about CATEGORIZING AND MINIFYING.
So here comes our approach angle, crafting senmantics understanding about specified data sets will help compressing algo to tackle each data sets differently. Thus, it can achieve better compress ratio.
To be more specific, data sets categorized as each columns (vertically) can form a more unified format, which can be more easily applied encoding algorithms on, such as Dictionary Encoding.
The algos will be mentioned here this time is the ones that have been used. This list may be added/updated in the future. The list as below,
- DELTA Encoding compresses data by recording the difference between values that follow each other in the column. This difference is recorded in a separate dictionary for each block of column values on disk. For example, timestamp is good case since it is always sequential. So instead of storing big chunk of prefix, we just store the delta. This would save us a lot of space.
- Runlength Encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences. This encoding is best suited to a table in which data values are often repeated consecutively, for example, when the table is sorted by those values.
- Dictionary Encoding uses a separate dictionary of unique values which is created for each block of column values. Instead of storing the actual value, a smaller sized dictionary token is replacing the actual value. This encoding is very effective when a column contains a limited number of unique values.
- Huffman Coding is a particular type of optimal prefix code that is commonly used for lossless data compression. In core, Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called “prefix-free codes”, that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol).
According to the theory above, the implementation will mainly be composed by 3 parts.
- Firstly analyzing the data sets in semantics
- Secondly categorizing/columnizing similar data attributes using fixed pattern, which makes restructing from the encoded data feasible.
- Lastly, compress each featured data attributes using well-established compressing tools. If time allowed, further customized compressing programe can be made. But due to limitation of time & resources, most compressing/decompressing customizedly made will be just as equally good as gzip on best cases given the experienced experiments here
Finally we achieved the below result by above approach.
Final Compression Ratio Benchmark Sample Size Raw File (B) Gz File (B) Our File (B) Improve Over Gzip 100MB 114462596 11894010 8613602 38.08% (Compression Rate) 10.391% 7.525% 500MB 557523253 57931368 31205248 85.64% (Compression Rate) 10.390% 5.597% 1GB 1098079669 115986078 62586282 85.32% (Compression Rate) 10.562% 5.699% Final Computational Time Benchmark Sample Size Gzip Time Spent Customized Time Spent 100MB 0m2.847s 0m48.404s 500MB 0m14.030s 4m14.720s 1GB 0m27.649s 10m33.379s