www.data-compression.info The Data Compression Resource on the Internet
Data Compression Corpora
This page contains a list of pointers to different corpora for data compression. A corpus is a distinct set of files, used for evaluating the practical performance of different compression schemes. The compression rate is measured in bits per symbol (bps) as the quotient of the size of the output in bits to the size of the input in bytes. A value of 8 bps means no compression, smaller values represent better (stronger) compression. Many times the unweighted average compression rate of all files of the corpus is calculated and compared between different algorithms. The unweighted average compression rate is calculated by summing up the compression rates for each file of the corpus and dividing the sum by the number of files. Sometimes a weighted average is calculated, which is the total compressed size divided by the total uncompressed size of all files of the corpus. In the last case the bigger files get a stronger influence on the result - they weight more. It is also possible to compare the execution time of the algorithms as long as all algorithms are running on the same computer. There are different corpora for different data types. Some corpora contain smaller files, some larger files. Some corpora put the emphasis on text files, others on picture, video, sound or protein files.