Preprocessing

www.data-compression.info
The Data Compression Resource on the Internet

Preprocessing

Today the most popular schemes for lossless data compression are the Burrows-Wheeler Compression Algorithm (BWCA), Prediction by Partial Matching (PPM) and Lempel-Ziv (LZ) based compression schemes. The first two schemes are context related, whereas the LZ scheme is based on repetitions. Even though each of these schemes can be used to compress any kind of data, they do not consider the special properties of different kinds of data, like textual data, record based data or graphical data.
The ccompression rate of such standard schemes can often be enhanced by using preprocessing algorithms specialized for the respective kind of data.
Preprocessing algorithms are reversible transformations, which are performed before the actual compression scheme during encoding and afterwards during decoding.

Publications

Logo

Title

Description

Universal Text Preprocessing for Data Compression

This paper from 2005 by Jürgen Abel and Bill Teahan presents several preprocessing algorithms for textual data, which work with BWT, PPM and LZ based compression schemes. The algorithms need no external dictionary and are language independent. The average compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.

Record Preprocessing for Data Compression

Another preprocessing paper from 2003 by Jürgen Abel. The paper reveals a preprocessing algorithm which exploits the structure of record based files. The compression rate of these files can be enhanced if the file is byte-wise transposed before compression and after decompression by the record length. The approach is able to detect files with such a structure and to determine the corresponding record length.

Text Preprocessing for Burrows-Wheeler Block-Sorting Compression

This paper from 1999 by Szymon Grabowski is an interesting text preprocessing paper for BWCAs. It describes capital conversion, space stuffing, phrase substitution, alphabet reordering and EOL coding. A drawback is the dependence on the English language.

Revisiting dictionary-based compression

The paper from 2005 by Przemyslaw Skibinski, Szymon Grabowski and Sebastian Deorowicz describes several aspects of dictionary-based compression, including a compact dictionary representation, a word replacing transformation, a PPM/BWCA oriented scheme and an LZ77 oriented scheme. It uses a fixed external English dictionary.

People

Logo

Name

Description

Jürgen Abel

Jürgen Abel is the author of this site and of the compression program ABC. His research interests are data compression, data encryption, information retrievel and 3-D graphics.

Sebastian Deorowicz

Sebastian is the author of the Weigthed Frequency Count algorithm (WFC) and an Assistant Professor of the Silesian University of Technology, Poland.

Szymon Grabowski

Szymon Grabowski is working at the Technical University of Lodz, Poland, his research interests include pattern recognition (Ph.D. in 2003), text indexing and data compression.

Przemyslaw Skibinski

Przemyslaw Skibinski is working at the Wroclaw University, Poland, and member of the Computational Complexity and Algorithms Group.

William Teahan

Bill is working at the School of Informatics in Bangor, University of Wales, United Kingdom. His research interests include text compression, information theory, compression-based language models, computational linguistics, information retrieval and text mining.
Bill is quite active in orienteering and running (I almost couldn't keep up with him, when we climbed some hills in Wales).

Source Code

Logo

Title

Description

WRT 3.0

WRT (Word Replacing Transformation) is a preprocessing algorithm by Przemyslaw Skibinski, which is based on StarNT and replaces words with numbers of a fixed external dictionary.

Copyright © 2002-2022 Dr.-Ing. Jürgen Abel, Lechstraße 1, 41469 Neuß, Germany. All rights reserved.