www.data-compression.info
The Data Compression Resource on the Internet

Contents

 Burrows-Wheeler Transformation / Block Sorting (BWT)


A very promising recent development in the field of lossless data compression is the Burrows-Wheeler Compression Algorithm (BWCA), introduced in 1994 by Michael Burrows and David Wheeler. The algorithm received considerable attention since of its Lempel-Ziv like execution speed and its compression performance close to state-of-the-art PPM algorithms.
It is based on a permutation of the input sequence - the Burrows-Wheeler Transformation (BWT), also called Block Sorting -, which groups symbols with a similar context close together. In the original version, this permutation was followed by a move to front (MTF) transformation and a final entropy coding (EC) stage. Later versions used different algorithms which come after the Burrows-Wheeler transform, since the stages after the Burrows-Wheeler transform have a significant influence on the compression rate too.
Within many approaches the MTF stage is replaced by a different stage in order to achieve a better ranking. Since the task of the MTF stage or its replacement is to transfrom the local structure of the BWT output into a global structure, this stage is called a Global Structure Transformation (GST). Representatives of GSTs are MTF, WFC, AWAF, IF, SIF and DC.
A Run Length Encoding (RLE) stage, which exist in many variations, is also quite common, most the time in front of the EC Stage.

A basic BWCA scheme
A basic BWCA scheme

If you are a newcommer in the BWT field and you are looking for some basic and important papers, here are some recommendations:
 -
A Block-Sorting Lossless Data Compression Algorithm: it all started with this paper from Michael Burrows and David Wheeler,
 -
Data Compression with the Burrows-Wheeler Transform: Mark’s basic BWCA article,
 -
Block Sorting Text Compression - Final Report: Peter’s technical BWCA report with many useful information like the RLE0 scheme from Wheeler,
 -
Second step algorithms in the Burrows-Wheeler compression algorithm: Sebastian describes the most important post BWT stages and his WFC stage,
 -
Modifications of the Burrows and Wheeler Data Compression Algorithm: an BWCA paper from Bernhard, which proposes many properties of the BWT output sequence
 -
One attempt of a compression algorithm using the BWT: another BWCA paper from Bernhard, which introduces the ternary sequence and it’s coding,
 -
Improving Suffix-Array Construction Algorithms with Applications: the Masters thesis of Tsai-Hsing Kao compares many suffix sorting algorithms for BWT and introduces a new and very fast method based on the two stage suffix sort by Hideo Itoh and Hozumi Tanaka,
 -
Unifying Text Search And Compression -Suffix Sorting, Block Sorting and Suffix Arrays: the doctoral thesis of Kunihiko includes the best and most extensive explanation of his famous suffix sorting algorithm beside many other descriptions of suffix sorting algorithms for BWT,
 -
Improvements to the Burrows-Wheeler Compression Algorithm: After BWT Stages: a BWCA paper from my person, which gives a short introduction into the BWCA field and proposes several improvements for post BWT stages. The hybrid BWCA approach of this paper achieves a compression rate of 2.238 bps, which is the best result for a pure BWCA without any preprocessing before the BWT to date.

The hybrid BWCA scheme
The hybrid BWCA scheme from the paper
Improvements to the Burrows-Wheeler Compression Algorithm: After BWT Stages

An algorithm for constructing suffix arrays is one possibility to calculate the BWT output sequence very fast and efficiently. In general, one could also use bubble sort (well, that would actually realy spoil the speed), quicksort or radix sort in order to calculate the BWT output sequence, but they are much slower. Some researchers use a suffix tree algorithm for this task, but it needs a lot of pointers (some like it hot) and usually a suffix array algorithm is faster, even though suffix tree algorithms run in linear time and most suffix array algorithms don't.
More information about sorting, string sorting, suffix trees and suffix arrays can be found at
String Sorting.

The BWCA is also a very promising candidate in the field of bio-informatics, as it can be used to compress protein files and for indexing and searching inside protein and DNA files.

 Publications


Logo

Title

Description

Grundlagen des Burrows-Wheeler-Kompressionsalgorithmus

Grundlagen des Burrows-Wheeler-Kompressionsalgorithm us
 

A German article from 2003 by Jürgen Abel about the principles of the Burrows-Wheeler Compression Algorithm, written for the journal "Informatik - Forschung und Entwicklung", Springer-Verlag Heidelberg, Germany, Association for Informatics (Gesellschaft für Informatik eV.). The article describes a BWCA implementation with a RLE and WFC stage.
The BWCA approach PIK03 achieves a compression rate of 2.25 bps for the Calgary Corpus.
 

Verlustlose Datenkompression auf Grundlage der Burrows-Wheeler-Transformation

Verlustlose Datenkompression auf Grundlage der Burrows-Wheeler-Transformation
 

Another German article from 2003 by Jürgen Abel about the basics of the Burrows-Wheeler Compression Algorithm, written for the journal "PIK - Praxis der Informationsverarbeitung und Kommunikation", Saur Verlag, Germany, Association for Informatics (Gesellschaft für Informatik eV.). The paper describes a simple but fast BWCA implementaion PIK03 with a MTF, RLE0 and EC stage, which has the same compression speed as GZIP and BZIP2.
The BWCA approach PIK03 achieves a compression rate of 2.35 bps for the Calgary Corpus.
 

DNA sequence compression using the Burrows-Wheeler Transform

Improvements to the Burrows-Wheeler Compression Algorithm: After BWT Stages
 

A BWCA paper from 2003 by Jürgen Abel, which gives a short introduction into the BWCA field and proposes several improvements for the WFC stage and the IF stage and introduces a new RLE scheme for bypassing the run length information around the WFC stage. This paper is the basis of the program ABC.
This BWCA approach achieves a compression rate of 2.24 bps for the Calgary Corpus.
 

Universal Text Preprocessing for Data Compression

Universal Text Preprocessing for Data Compression
 

This paper from 2003 by Jürgen Abel and Bill Teahan presents several preprocessing algorithms for textual data, which work with BWT, PPM and LZ based compression schemes. The algorithms need no external dictionary and are language independent. The average compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.
 

Record Preprocessing for Data Compression

Record Preprocessing for Data Compression
 

Another preprocessing paper from 2003 by Jürgen Abel. The paper reveals a preprocessing algorithm which exploits the structure of record based files. The compression rate of these files can be enhanced if the file is byte-wise transposed before compression and after decompression by the record length. The approach is able to detect files with such a structure and to determine the corresponding record length.
 

DNA sequence compression using the Burrows-Wheeler Transform

DNA sequence compression using the Burrows-Wheeler Transform
 

Don Adjeroh, Yong Zhang, Amar Mukherjee, Matt Powell and Tim Bell present in 2002 a proposal for compressing DNA sequences by using an off-line dictionary oriented approache in combination with a BWCA.
 

Text Compression with the Burrows-Wheeler Transform

Text Compression with the Burrows-Wheeler Transform
 

James Allen gives a introduction of the BWT and presents a simple BWCA implementation with source code in C on his internet site from 2002. He uses two coding stages I haven't seen anywhere else in the BWCA field: Zero Running (ZR) after the MTF stage and Elias's gamma coding as the EC stage.
 

Lexical Permutation Sorting Algorithm

Lexical Permutation Sorting Algorithm
 

This article by Ziya Arnavut and Spyros Magliveras from 1996 describes a "Lexical Permutation Sorting Algorithm" (LPSA), which is a generalization of the Burrows-Wheeler Compression Algorithm (BWCA).
 

Block Sorting and Compression

Block Sorting and Compression
 

This article by Ziya Arnavut and Spyros Magliveras from 1997 describes a "Lexical Permutation Sorting Algorithm" (LPSA), which is a generalization of the Burrows-Wheeler Compression Algorithm (BWCA), and a new post-BWT stage, called Inversion Frequencies (IF).
 

Move-to-Front and Inversion Coding

Move-to-Front and Inversion Coding
 

This conference proceeding paper from Ziya Arnavut at DCC 2000 compares the MTF stage with the IF stage. The IF stage, called inversion coding (ranking) technique, yields better compression gain than the recency ranking (MTF coder) for almost all the test data files.
 

Investigation of Block-Sorting of Multiset Permutations

Investigation of Block-Sorting of Multiset Permutations
 

Ziya and Meral Arnavut investigate in this 2004 article special cases of permutations, called multiset permutations, together with a transform different to the BWT, called Linear Ordering Transformation (LOT). LOT sorts rows of a cyclic rotated matrix with respect to the first element of each row.
 

Inversion Coder

Inversion Coder
 

A paper from 2004 by Ziya Arnavut presents and compares different inversion ranking schemes (IF) with recency ranking schemes (MTF). For large files, image and DNA files, IF achieves better compression rate than MTF schemes.
 

LIPT: A reversible lossless text transform to improve compression performance

LIPT: A reversible lossless text transform to improve compression performance
 

In this paper from 2001  by Fauzia Awan, Nan Zhang, Nitin Motgi, Raja Iqbal and Amar Mukherjee a text pre-processing algorithm is described for PPM and BWCA, which leads to a better compression rate. The gain for BWCA is around 5% on the Calgary Corpus. The algorithm replaces words by sequneces of "*" and letters.
By using an extern dictionary it is dependent on a specific language.
 

A new text preprocessing algorithm for bzip2 and PPM

A new text preprocessing algorithm for bzip2 and PPM
 

This is the same article from 2001 than "LIPT: A reversible lossless text transform to improve compression performance" by Fauzia Awan, Nan Zhang, Nitin Motgi, Raja Iqbal and Amar Mukherjee about text pre-processing for PPM and BWCA.
 

Selective Application of Burrows-wheeler Transformation for Enhancement of JPEG Entropy Coding

Selective Application of Burrows-wheeler Transformation for Enhancement of JPEG Entropy Coding
 

Hyun-Kee Baik, Dong Ha, Hyun-Gyoo Yook, Sung-Chul Shin and Myong-Soon Park describe in their paper from 1999 a variant of a BWCA in order to improve jpg-image compression.
 

A New Method to Improve the Performance of JPEG Entropy Coding Using Burrows-Wheeler Transformation

A New Method to Improve the Performance of JPEG Entropy Coding Using Burrows-Wheeler Transformation
 

Another paper by Hyun-Kee Baik, Hyun-Gyoo Yook, Sung-Chul Shin, Myong-Soon Park and Dong Sam Ha from 1999 about using BWCA for jpg-image compression.
 

Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice

Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice

 

Bernhard Balkenhol and Stefan Kurtz describe in their article from 1998 a BWCA which uses a BWT, MTF, RLE0 and EC stage.
This BWCA approach achieves a compression rate of 2.32 bps for the Calgary Corpus.
 

Modifications of the Burrows and Wheeler Data Compression Algorithm

Modifications of the Burrows and Wheeler Data Compression Algorithm
 

This paper of Bernhard Balkenhol, Stefan Kurtz and Yuri Shtarkov from 1999 describes a modification of the MTF algorithm, which moves the next symbol to the second place in the list instead of the first place, except the old position of the new symbol was 0 or 1, then it is moved to the first place.
This BWCA approach achieves a compression rate of 2.30 bps for the Calgary Corpus.
 

One attempt of a compression algorithm using the BWT

One attempt of a compression algorithm using the BWT
 

Another important paper of Bernhard Balkenhol and Yuri Shtarkov from 1999 with focus on improvements of the Burrows-Wheeler Compression Algorithm (BWCA).
It includes some interesting aspects for the RLE field too. An important property of the output of the BWT is the presents of many runs, which results in overestimating the probability of symbols outside the run. This problem is called "Pressure of Runs" and can be decreased by RLE. Besides that, a modification of the MTF algorithm is described which moves the next symbol to the second place in the list instead of the first place, except the old position of the new symbol was 0 or 1. If the old position of the new symbol was 1 it is moved to the first place only if the last output number was different from 0.
This paper is my favourite paper from Bernhard.
This BWCA approach achieves a compression rate of 2.26 bps for the Calgary Corpus.
 

Tree Source Identification with the Burrows Wheeler Transform

Tree Source Identification with the Burrows Wheeler Transform
 

Dror Baron and Yoram Bresler study in their article from 2000 the identification of a tree source model from a given sequence by using the BWT as a sorting transformation for states in the tree.
 

A locally adaptive data compression scheme

A locally adaptive data compression scheme
 

The original MTF paper of Jon Bentley, Daniel Sleator, Robert Tarjan and Victor Wei from 1986. The paper descibes a word based MTF scheme as a data compression scheme, that exploits locality of reference. It is based on a simple heuristic for self-organizing sequential search.
 

Distance Coding newsgroup message

Distance Coding newsgroup message
 

The famous newsgroup posting from 2000 by Edgar Binder, where he describes the DC algorithm by an example together with 3 important properties of the DC output sequence.
 

Inverting the Burrows-Wheeler Transform

Inverting the Burrows-Wheeler Transform
 

Richard Bird and Shin-Cheng Mu describe in their brief article from 2001 the inverse transformation of the BWT.
 

A Block-Sorting Lossless Data Compression Algorithm

A Block-Sorting Lossless Data Compression Algorithm
 

Michael Burrows and David Wheeler introduce the Burrows-Wheeler Transform (BWT) and the Burrows-Wheeler Compression Algorithm (BWCA) in a research report from 1994 at the Digital Systems Research Center. The paper explains the transformation and why the output sequence compresses well. Some variants of the algorithm, such as a modified MTF, are named too.
This paper is a "must" for everyone interested in BWCAs.
This BWCA approach achieves a compression rate of 2.43 bps for the Calgary Corpus.
 

On variants of block-sorting compression using context from both the left and right

On variants of block-sorting compression using context from both the left and right
 

Michael Burrows and Li Zhang describe a variation of BWT, which sorts contexts from left and right. The problem is, that the output of the new transfrom is not unambiguously, meaning that different input strings are transformed to the same output string. In order to unambiguously decode the output, one has to add log n-bits to the output.
 

BWT, a transformation algorithm

BWT, a transformation algorithm
 

A brief article by Anturo San Emeterio Campos from 1999, describing the basics of the BWT.
 

Higher Compression from the Burrows-Wheeler Transform by Modified Sorting

Higher Compression from the Burrows-Wheeler Transform by Modified Sorting
 

Brenton Chapin and Stephen Tate show in their article from 1998, that the sorting order of the alphabet has a significant impact on the compression rate of a BWCA. They present a heuristic manuel alphabet permutation and a calculated alphabet permutation. The prsented calculated alphabet permutation does not improve the compression rate in all cases. For text files their heuristic manuel alphabet permutation achieves better results.
 

Switching Between Two On-Line List Update Algorithms for Higher Compression of Burrows-Wheeler Transformed Data

Switching Between Two On-Line List Update Algorithms for Higher Compression of Burrows-Wheeler Transformed Data
 

In his article from 2000 Brenton Chapin describes a switching scheme between two list update algorithms for a BWCA, a MTF scheme and a so called "Best x of 2x - 1" algorithm.
This BWCA approach achieves a compression rate of 2.33 bps for the Calgary Corpus.
 

Higher Compression from the Burrows-Wheeler Transform with new Algorithms for the List Update Problem

Higher Compression from the Burrows-Wheeler Transform with new Algorithms for the List Update Problem
 

The doctoral thesis of Brenton Chapin describes not only different LUAs but also some preprocessing schemes like alphabet reordering and variations of the BWT (Gray code sort) in order to achieve better compression rates.
 

The Burrows-Wheeler block sorting algorithm (long)

The Burrows-Wheeler block sorting algorithm (long)
 

A short introduction into the BWT can be found at the comp.compression FAQ from 1999. The internet site has a little example and describes briefly the BWCA.
 

An analysis of the second step algorithms in the Burrows-Wheeler compression algorithm

An analysis of the second step algorithms in the Burrows-Wheeler compression algorithm
 

A  quite similar publication from 2000 to the article "Second step algorithms in the Burrows-Wheeler compression algorithm" from Sebastian Deorowicz, describing his Weighted Frequency Count Algorithm.
This BWCA approach achieves a compression rate of 2.25 bps for the Calgary Corpus.
 

Improvements to Burrows-Wheeler Compression Algorithm

Improvements to Burrows-Wheeler Compression Algorithm
 

Sebastian Deorowicz describes in his article from 2000 a BWCA, based on a modified MTF and RLE0 stage. He also uses an adapted EC stage.
This BWCA approach achieves a compression rate of 2.27 bps for the Calgary Corpus.
 

Second step algorithms in the Burrows-Wheeler compression algorithm

Second step algorithms in the Burrows-Wheeler compression algorithm
 

This publication of Sebastian Deorowicz in "Software-Practice and Experience" from 2002 give a quite complete overview of the post BWT stages used within the Burrows-Wheeler Compression Algorithm. Besides his own Weighted Frequency Count Algorithm and other post BWT stages, Sebastian describes briefly but clearly variantions of the Move To Front scheme, the Inversion Frequencies algorithm and the Distance Coding algorithm with the 3 properties of the newsgroup posting. This is one of my favourite BWCA papers.
This BWCA approach achieves a compression rate of 2.25 bps for the Calgary Corpus.
 

Universal lossless source coding with the Burrows Wheeler transform

Universal lossless source coding with the Burrows Wheeler transform
 

The work from 1999 of Michelle Effros, Karthik Visweswariah, Sanjeev Kulkarni, Sergio Verdú gives a theoretical evaluation of BWT-based coding. It prooves that BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in LZ-style codes.
 

Textkomprimierung - Multimedia-Seminar

Textkomprimierung - Multimedia-Seminar
 

Stefan Fabrewitz describes in his German seminar paper from 2000 several compression methods for text compression. He also has a brief chapter about the BWT.
 

Experiments with a Block Sorting Text Compression Algorithm

Experiments with a Block Sorting Text Compression Algorithm
 

In his article form 1995 Peter Fenwick desribes several variations of BWCAs, One variation is a modified MTF scheme, which does not allways move the current symbol to the very front of the list but a little bit behind (index greater than 0). These appraoches achieved normaly a reduction in compression performance. He also describes some sorting improvements.
This BWCA approach achieves a compression rate of 2.43 bps for the Calgary Corpus.
 

Improvements to the Block Sorting Text Compression Algorithm

Improvements to the Block Sorting Text Compression Algorithm
 

Peter Fenwick introdues 1995 some BWCA improvements and variations like sort improvements, LZ77 preprocessing, a delta model and the direct coding of the BWT output.
This BWCA approach achieves a compression rate of 2.36 bps for the Calgary Corpus.
 

Block Sorting Text Compression - Final Report

Block Sorting Text Compression - Final Report
 

This technical paper is one of Peter’s most important BWCA papers published in 1996. Peter describes the basic transformation by an example and several improvements of the basic algorithm, like hierarchical coding models, structured coding models and Wheeler’s Run Length Encoding for zeros (RLE0). He also gives some sort and MTF improvements and a interpretation of the BWT output sequence.
This is a "must", if you are interested in the BWCA field.
This BWCA approach achieves a compression rate of 2.34 bps for the Calgary Corpus.
 

Block Sorting Text Compression / Proceedings of the 19th Australian Computer Science Conference 1996

Block Sorting Text Compression / Proceedings of the 19th Australian Computer Science Conference 1996
 

This paper is similar to Peter’s "Block Sorting Text Compression - Final Report" and shows about the same several variants and improvements to BWCAs.
This BWCA approach achieves a compression rate of 2.34 bps for the Calgary Corpus.
 

Burrows Wheeler - Alternatives to Move to Front

Burrows Wheeler - Alternatives to Move to Front
 

The paper from 2003 by Peter Fenwick, Mark Titchener and Michelle Lorenz describes a cached MTF modeling scheme with a foreground (more frequent symbols) and background (less frequent symbols) model. It also introduces the so called T-entropy as a measurement for entropies after each BWCA stage.
 

Burrows-Wheeler Compression with Variable Length Integer Codes

Burrows-Wheeler Compression with Variable Length Integer Codes
 

Peter describes in his paper from 2002 an alternative to arithmetic or Huffman coding at the last stage of a BWCA by using variable length integer codes (Elias Gamma and Fraenkel-Klein Fibonacci codes). Furthermore a sticky MTF stage is revealed, which move a accessed symbol back to 40% where it come from except if the symbol is accessed twice at once.
 

An experimental study of an opportunistic index

An experimental study of an opportunistic index
 

In the article from 2001 Paolo Ferragina and Giovanni Manzini combine the BWCA with with the suffix array data structure in order to support data indexing.
 

Lossless, Reversible Transformations that Improve Text Compression Ratios

Lossless, Reversible Transformations that Improve Text Compression Ratios
 

In their paper form 2000 Robert Franceschini, Holger Kruse, Nan Zhang, Raja Iqbal and Amar Mukherjee desribe 3 text preprocessing techniques for BWCAs and PPM: LPT, RLPT and SCLPT. The achieve a compression gain of 7.1% over BZIP. The disadvantages of their method are, that they are language dependent and that they need an extern dictionary.
 

Compression, part 3: Burrows-Wheeler transform

Compression, part 3: Burrows-Wheeler transform
 

A short introduction into the BWT with some examples is presented by Ole Friis on his internet site from 2001.
 

Burrows-Wheeler Transform in image compression

Burrows-Wheeler Transform in image compression
 

Markus Gärtner and David Havelin present on their internet site from 2000 a project for compression of images by using a transformation of the image, like DCT or DWT, before using the BWCA. They also plan to adept the BWT in order to process 2-dimensional data.
 

A space-economic suffix tree construction algorithm (Powerpoint Slides)

A space-economic suffix tree construction algorithm (Powerpoint Slides)
 

This powerpoint slide from 1997 by Robert Giegerich and Stefan Kurtz explains the suffix tree construction algorithm of McCreight from 1976.
 

Efficient Implementation of Lazy Suffix Trees

Efficient Implementation of Lazy Suffix Trees
 

Robert Giegerich, Stefan Kurtz and Jens Stoye present 1999 time and space efficient suffix sorting algorithms.