|
|
|
This page contains a list of pointers to different corpora for data compression. A corpus is a distinct set of files, used for evaluating the practical performance of different compression schemes. The compression rate is measured in bits per symbol (bps) as the quotient of the size of the output in bits to the size of the input in bytes. A value of 8 bps means no compression, smaller values represent better (stronger) compression. Many times the unweighted average compression rate of all files of the corpus is calculated and compared between different algorithms. The unweighted average compression rate is calculated by summing up the compression rates for each file of the corpus and dividing the sum by the number of files. Sometimes a weighted average is calculated, which is the total compressed size divided by the total uncompressed size of all files of the corpus. In the last case the bigger files get a stronger influence on the result - they weight more. It is also possible to compare the execution time of the algorithms as long as all algorithms are running on the same computer. There are different corpora for different data types. Some corpora contain smaller files, some larger files. Some corpora put the emphasis on text files, others on picture, video, sound or protein files.
Logo
|
Title
|
Description
|
|
The Calgary Corpus
|
Authors: Ian Witten, Timothy Bell and John Cleary Year: 1987 Location: University of Calgary, Canada.
|
|
The Canterbury Corpus
|
Authors: Ross Arnold and Timothy Bell Year: 1997 Location: University of Canterbury, New Zealand.
|
|
Lukas Corpus
|
Authors: Jürgen Abel Year: 2006 Location: This site.
|
|
The Protein Corpus
|
Authors: Craig Nevill-Manning and Ian Witten Year: 1999 Location: This site, paper from the IEEE Data Compression Conference 1999, Snowbird, Utah, United States of America.
|
|
The Silesia Corpus
|
Author: Sebastian Deorowicz Year: 2003 Location: Silesian University of Technology, Poland.
|
|
|
|
|