Wednesday, 3 September 2014

Simple Compression Algorithms comparison

This is an overview of very rough tests for looking at compression for handling a few TB of price data in smaller files. Rough with the numbers of runs, they're dependent on the default windows command line utilities (some for instance aren't using multiple cores), the machine, the files, the OS. But it's an indication.
----
XZ achieved the best compression of around 8%, matched by 7zip. Note the xz compression was much quick from the 7zip utilities than from using the xz utils
Note also that this took 600 times as long as Zhuff's quickest mode, so there is a clear distinction between speed of compression and size going from 17% compressed size to 8% compressed size.
 Zhuff looks great  - getting close to 7zip in compression and vastly quicker, using cores etc. I did have one crash with it (not sure why), so maybe worth waiting a bit for more general releases (it's also only beta at the moment).
Snappy and LZ4 look much quicker than gzip with some loss of compression, with LZ4 outperforming Snappy

Short answer I'll use LZ4 until ZHuff is definitively stable and then switch for a reduction in size for this use case. For maximum compression or manually saving space it's 7zip with native 7z or XZ

-----------longer detail -------------
As part of storing some price tick data, one approach is just to store compressed delimited or fixed format files, say in a file per instrument per day.
Compression might improve the performance and increase storage density
Some standard compressions algorithms
7zip (as a base comparison for compression and time)
gzip
XZ
lz4
Google's Snappy
and for interest a comparison with
LZ4's creator's new compression
zhuff
initially I was just looking for a rough comparison and used a 200MB text file
Then checked a 1.2GB file and a 6 GB file
This on a dual Xeon E5-2630 2.3 GHz machine, 32 GB ram and hard discs, windows server 2008 r2 standard
Note this includes includes compressing and decompressing to the hard disc - the actual use case would be compress to disc and then uncompress to memory. Some of the libraries will be using the cores better.
Even though I agree there's a lot of apples and oranges in this rough test it gives an indication of what's possible.
I was timing using
Measure-Command {Start-Process uncompress.bat -Wait}
from within Windows Powershell


The Snappy binary performed worse than the default LZ4.
the time to compress or uncompress with LZD was less than the time difference for getting an uncompressed file on or off disc to moving a compressed file. I.e. a straight copy of a 6GB file was slower than compressing it and saving the compressed file.

-------testing results ---------
program type compress time s
% size k uncompress time s
7zip zip 38.3 38.1
18% 36729 2.0 1.6 1.6 1.8
7zip 7z 87.0

10% 20913 1.8 1.9 2.0

7zip -9 158.6

8% 16567 1.7



7zip -8 122.9

8% 16750 1.7



7zip xz -9 155.15208

8% 16570 1.7



gzip
9.6 9.7
21% 43687 1.8 1.9


gzip -1 4.1 4.1
25% 51190 2.2 2.0


gzip -9 36.5 28.5
20% 41852 1.9 1.9


lz4 -1 0.3 0.3
34% 70250 0.3 0.3


lz4 -9 1.4 1.4
25% 51779 0.3 0.3


snappy snappy 1.1 1.4 1.0 35% 72238 0.9 1.0


zhuff #0 0.2

17% 35161 0.2



zhuff #1 0.6

20% 41752 0.2



zhuff #2 1.1

18% 37043 0.3



xz -0 13.6

20% 40919 5.2



xz -1 13.4

15% 30478 3.9



xz -2 20.2

14% 27800 3.3



xz -3 34.1

13% 26392 3.4



xz -4 52.3

13% 25845 3.5



xz -5 96.3

10% 21234 2.6



xz -6 188.3

9% 17449 2.4



xz -7 209.9

8% 17022 2.7



xz -8 235.3

8% 16754 2.5



xz -9 260.3

8% 16569 2.4



xz -e9 274.6

8% 16469 2.5



xz -e6 192.6

8% 17356 2.4



























200MB file





















program type compress time s
% size k uncompress time s
7zip zip


0%





7zip 7z 326.9

10% 119953




gzip



0%





gzip -1


0%





gzip -9


0%





lz4 -1 1.2

23% 278311 1.3



lz4 -9 6.2

19% 220668 1.3



snappy snappy 7.8

24% 286688 2.6



zhuff #0 0.8

16% 186285 0.8



zhuff #1 4.7

14% 170605 1.0


crashed once
zhuff #2 4.5

13% 153659 2.7















1.2GB
copy1G2.bat 14.3









robocopy  18.9































program type compress time s
% size k uncompress time s
lz4 -1 5.3

23% 1372662 5.6



lz4 -9 51.7

18% 1084319 5.8



snappy snappy 18.219078

24% 1407327 16



gzip -1 90.034402

17% 1006793 48.4



zhuff #0 3.8

15% 912290 4.3



zhuff #1 12.8

14% 836208 4.0



zhuff #2 17.7

13% 752741 4.0



5.978GB file