www.data-compression.info
The Data Compression Resource on the Internet

Contents

 Data Compression Corpora


This page contains a list of pointers to different corpora for data compression.
A corpus is a distinct set of files, used for evaluating the practical performance of different compression schemes.
The compression rate is measured in bits per symbol (bps) as the quotient of the size of the output in bits to the size of the input in bytes. A value of 8 bps means no compression, smaller values represent better (stronger) compression.
Many times the unweighted average compression rate of all files of the corpus is calculated and compared between different algorithms. The unweighted average compression rate is calculated by summing up the compression rates for each file of the corpus and dividing the sum by the number of files.
Sometimes a weighted average is calculated, which is the total compressed size divided by the total uncompressed size of all files of the corpus. In the last case the bigger files get a stronger influence on the result - they weight more.
It is also possible to compare the execution time of the algorithms as long as all algorithms are running on the same computer.
There are different corpora for different data types. Some corpora contain smaller files, some larger files. Some corpora put the emphasis on text files, others on picture, video, sound or protein files.

 List of Corpora


Logo

Title

Description

The Calgary Corpus

The Calgary Corpus
 

Authors: Ian Witten, Timothy Bell and John Cleary
Year: 1987
Location: University of Calgary, Canada.

 

The Canterbury Corpus

The Canterbury Corpus
 

Authors: Ross Arnold and Timothy Bell
Year: 1997
Location: University of Canterbury, New Zealand.

 

The Lukas Corpus

Lukas Corpus
 

Authors: Jürgen Abel
Year: 2006
Location: This site.

 

The Protein Corpus

The Protein Corpus
 

Authors: Craig Nevill-Manning and Ian Witten
Year: 1999
Location: This site, paper from the IEEE Data Compression Conference 1999, Snowbird, Utah, United States of America.

 

The Silesia Corpus

The Silesia Corpus
 

Author: Sebastian Deorowicz
Year: 2003
Location: Silesian University of Technology, Poland.

 

 

Copyright © 2002-2022 Dr.-Ing. Jürgen Abel, Lechstraße 1, 41469 Neuß, Germany. All rights reserved.