Calgary Corpus

www.data-compression.info
The Data Compression Resource on the Internet

Calgary Corpus

The University of Calgary

The Calgary Corpus is the most referenced corpus in the data compression field exspecially for text compression and is the de facto standard for lossless compression evaluation.
The corpus was founded in 1987 by Ian Witten, Timothy Bell and John Cleary for their research paper "MODELING FOR TEXT COMPRESSION" at the University of Calgary, Canada. The research paper was published in 1989 at ACM Computing Surveys. In 1990 the corpus was used in their book "Text compression".
The corpus consists of 18 files (Large Calgary Corpus: bib, book1, book2, geo, news, obj1, obj2, paper1, paper2, paper3, paper4, paper5, paper6, pic, progc, progl, progp and trans), but only 14 files were used in the paper and book (Standard Calgary Corpus: all files except paper3, paper4, paper5 and paper6).
The corpus is available below and at ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus.

The contexts of the readme-file of the Calgary Corpus is listed below:

Welcome to the Calgary/Canterbury text compression corpus. This corpus is used in the book

Bell, T.C., Cleary, J.G. and Witten, I.H. Text compression.
Prentice Hall, Englewood Cliffs, NJ, 1990

and in the survey paper

Bell, T.C., Witten, I.H. and Cleary, J.G. "Modeling for text
compression," Computing Surveys 21(4): 557-591; December 1989,

to evaluate the practical performance of various text compression schemes. Several other researchers are now using the corpus to evaluate text compression schemes.
Nine different types of text are represented, and to confirm that the performance of schemes is consistent for any given type, many of the types have more than one representative. Normal English, both fiction and non-fiction, is represented by two books and papers (labeled book1, book2, paper1, paper2, paper3, paper4, paper5, paper6). More unusual styles of English writing are found in a bibliography (bib) and a batch of unedited news articles (news). Three computer programs represent artificial languages (progc, progl, progp). A transcript of a terminal session (trans) is included to indicate the increase in speed that could be achieved by applying compression to a slow line to a terminal. All of the files mentioned so far use ASCII encoding. Some non-ASCII files are also included: two files of executable code (obj1, obj2), some geophysical data (geo), and a bit-map black and white picture (pic). The file geo is particularly difficult to compress because it contains a wide range of data values, while the file pic is highly compressible because of large amounts of white space in the picture, represented by long runs of zeros.
More details of the individual texts are given in the book mentioned above. Both book and paper give the results of compression experiments on these texts.
The corpus itself constitutes files bib, book1, book2, geo, news, obj1, obj2, paper1, paper2, paper3, paper4, paper5, paper6, pic, progc, progl, progp and trans. (The book and paper above do not give results for files paper3, paper4, paper5 or paper6.)
The directory "index" contains the sizes of the files and some information about where they came from.

Publications

Logo

Title

Description

MODELING FOR TEXT COMPRESSION

Although from 1988 this paper from Timothy Bell, Ian Witten and John Cleary is one of my favourites. It is easy to read, well structured and explains all important details.
Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling which are suitable for use in practical text compression systems. The strategies fall into three main classes: finite-context modeling, in which the last few characters are used to condition the probability distribution for the next one; finite-state modeling, in which the distribution is conditioned by the current state (and which subsumes finite-context modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.

Modeling for text compression

The same paper published from ACM Computing Surveys

Text Compression

Authors: Timothy Bell, John Cleary and Ian Witten
Publisher: Prentice-Hall, Englewood, United States of America, 1990
ISBN: 0-13-911991-4
Size: 318 pages
Price: 57.00 USD

This book is the reference on lossless compression. The emphasis is set on text compression and language modeling. It contains several statistical studies on text compression and explains in detail the adaptive modeling and the different PPM schemes (A, B, C).
It belongs to my favourite books in the data compression world.

People

Logo

Name

Description

Timothy Bell

Timothy Bell works at the University of Canterbury, New Zealand, and is the "father" of the Canterbury Corpus. His research interests include compression, computer science for children, and music.

John Cleary

John Cleary works at the University of Waikato, New Zealand, and has published several well known papers together with Ian Witten and Timothy Bell.

Ian Witten

Ian is working at the University of Waikato, New Zealand. Together with John Cleary and Timothy Bell he published "Modeling for Text Compression".

Source Code

Logo

Title

Description

The Large Calgary Corpus

ZIP-file with: bib, book1, book2, geo, news, obj1, obj2, paper1, paper2, paper3, paper4, paper5, paper6, pic, progc, progl, progp and trans

The Standard Calgary Corpus

ZIP-file with: bib, book1, book2, geo, news, obj1, obj2, paper1, paper2, pic, progc, progl, progp and trans

Copyright © 2002-2022 Dr.-Ing. Jürgen Abel, Lechstraße 1, 41469 Neuß, Germany. All rights reserved.