www.data-compression.info The Data Compression Resource on the Internet
Today the most popular schemes for lossless data compression are the Burrows-Wheeler Compression Algorithm (BWCA), Prediction by Partial Matching (PPM) and Lempel-Ziv (LZ) based compression schemes. The first two schemes are context related, whereas the LZ scheme is based on repetitions. Even though each of these schemes can be used to compress any kind of data, they do not consider the special properties of different kinds of data, like textual data, record based data or graphical data. The ccompression rate of such standard schemes can often be enhanced by using preprocessing algorithms specialized for the respective kind of data. Preprocessing algorithms are reversible transformations, which are performed before the actual compression scheme during encoding and afterwards during decoding.
This paper from 2005 by Jürgen Abel and Bill Teahan presents several preprocessing algorithms for textual data, which work with BWT, PPM and LZ based compression schemes. The algorithms need no external dictionary and are language independent. The average compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.
Another preprocessing paper from 2003 by Jürgen Abel. The paper reveals a preprocessing algorithm which exploits the structure of record based files. The compression rate of these files can be enhanced if the file is byte-wise transposed before compression and after decompression by the record length. The approach is able to detect files with such a structure and to determine the corresponding record length.
This paper from 1999 by Szymon Grabowski is an interesting text preprocessing paper for BWCAs. It describes capital conversion, space stuffing, phrase substitution, alphabet reordering and EOL coding. A drawback is the dependence on the English language.
The paper from 2005 by Przemyslaw Skibinski, Szymon Grabowski and Sebastian Deorowicz describes several aspects of dictionary-based compression, including a compact dictionary representation, a word replacing transformation, a PPM/BWCA oriented scheme and an LZ77 oriented scheme. It uses a fixed external English dictionary.
Bill is working at the School of Informatics in Bangor, University of Wales, United Kingdom. His research interests include text compression, information theory, compression-based language models, computational linguistics, information retrieval and text mining. Bill is quite active in orienteering and running (I almost couldn't keep up with him, when we climbed some hills in Wales).