The Canterbury Corpus was developed by Ross Arnold and Timothy Bell in 1997 at the University of Canterbury, New Zealand, as an improved version of the Calgary Corpus. The files were chosen because their results on existing compression algorithms are typical. The corpus itself was published at DCC 97 in the paper "A corpus for the evaluation of lossless compression". The final files of the corpus were chosen from a set of more than 800 files, which were relevant for inclusion in the corpus. The DCC 97 paper explains how the files were chosen, and why it is difficult to find "typical" files. There are two main editions of the Canterbury Corpus: the Standard Canterbury Corpus, consisting of 11 files (alice29.txt, asyoulik.txt, cp.html, fields.c, grammar.lsp, kennedy.xls, lcet10.txt, plrabn12.txt, ptt5, sum, xargs.1) and the Large Canterbury Corpus, consiting of 3 files (bible.txt, e.coli, world192.txt).
The paper which introduced the Canterbury Corpus from Ross Arnold and Timothy Bell in 1997 published at the DCC 97. The explains how the files were chosen, and why it is difficult to find "typical" files.
The internet site of the Canterbury Corpus maintained by Matt Powell This site includes many information about the corpus itself, the different editions, purpose, summary and details of compression rates and times for a variety of compression algorithms.
Matt Powell desribes in his paper from 2001 the work of maintaining the Canterbury Corpus website, and in particular the process of automating results generation. The popularity and usefulness of the Canterbury Corpus as a data compression standard is investigated, and several areas for further research and development of the current system are proposed.
Matt Powell is studying Computer Science at the University of Canterbury, New Zealand. He likes academic life and drawing cartoons. As the secretary of the University Comedy Club he does all sorts of comedy, including skits, improv, stand-up and songs.