The Calgary Corpus is the most referenced corpus in the data compression field exspecially for text compression and is the de facto standard for lossless compression evaluation. The corpus was founded in 1987 by Ian Witten, Timothy Bell and John Cleary for their research paper "MODELING FOR TEXT COMPRESSION" at the University of Calgary, Canada. The research paper was published in 1989 at ACM Computing Surveys. In 1990 the corpus was used in their book "Text compression". The corpus consists of 18 files (Large Calgary Corpus: bib, book1, book2, geo, news, obj1, obj2, paper1, paper2, paper3, paper4, paper5, paper6, pic, progc, progl, progp and trans), but only 14 files were used in the paper and book (Standard Calgary Corpus: all files except paper3, paper4, paper5 and paper6). The corpus is available below and at ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus.
The contexts of the readme-file of the Calgary Corpus is listed below:
Welcome to the Calgary/Canterbury text compression corpus. This corpus is used in the book
Bell, T.C., Cleary, J.G. and Witten, I.H. Text compression. Prentice Hall, Englewood Cliffs, NJ, 1990
and in the survey paper
Bell, T.C., Witten, I.H. and Cleary, J.G. "Modeling for text compression," Computing Surveys 21(4): 557-591; December 1989,
to evaluate the practical performance of various text compression schemes. Several other researchers are now using the corpus to evaluate text compression schemes. Nine different types of text are represented, and to confirm that the performance of schemes is consistent for any given type, many of the types have more than one representative. Normal English, both fiction and non-fiction, is represented by two books and papers (labeled book1, book2, paper1, paper2, paper3, paper4, paper5, paper6). More unusual styles of English writing are found in a bibliography (bib) and a batch of unedited news articles (news). Three computer programs represent artificial languages (progc, progl, progp). A transcript of a terminal session (trans) is included to indicate the increase in speed that could be achieved by applying compression to a slow line to a terminal. All of the files mentioned so far use ASCII encoding. Some non-ASCII files are also included: two files of executable code (obj1, obj2), some geophysical data (geo), and a bit-map black and white picture (pic). The file geo is particularly difficult to compress because it contains a wide range of data values, while the file pic is highly compressible because of large amounts of white space in the picture, represented by long runs of zeros. More details of the individual texts are given in the book mentioned above. Both book and paper give the results of compression experiments on these texts. The corpus itself constitutes files bib, book1, book2, geo, news, obj1, obj2, paper1, paper2, paper3, paper4, paper5, paper6, pic, progc, progl, progp and trans. (The book and paper above do not give results for files paper3, paper4, paper5 or paper6.) The directory "index" contains the sizes of the files and some information about where they came from.
Although from 1988 this paper from Timothy Bell, Ian Witten and John Cleary is one of my favourites. It is easy to read, well structured and explains all important details. Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling which are suitable for use in practical text compression systems. The strategies fall into three main classes: finite-context modeling, in which the last few characters are used to condition the probability distribution for the next one; finite-state modeling, in which the distribution is conditioned by the current state (and which subsumes finite-context modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.
Authors: Timothy Bell, John Cleary and Ian Witten Publisher: Prentice-Hall, Englewood, United States of America, 1990 ISBN: 0-13-911991-4 Size: 318 pages Price: 57.00 USD
This book is the reference on lossless compression. The emphasis is set on text compression and language modeling. It contains several statistical studies on text compression and explains in detail the adaptive modeling and the different PPM schemes (A, B, C). It belongs to my favourite books in the data compression world.