


The process of entropy coding (EC) can be split in two parts: modeling and coding. Modeling assigns probabilities to the symbols, and coding produces a bit sequence from these probabilities. As established in Shannon's source coding theorem, there is a relationship between a symbol’s probability and its corresponding bit sequence. A symbol with probability p gets a bit sequence of length log(p). In order to achieve a good compression rate, an exact propability estimation is needed. Since the model is responsible for the probability of each symbol, modeling is one the most important tasks in data compression. Entropy coding can be achieved by different coding schemes. A common scheme, which uses a discrete number of bits for each symbol, is Huffman coding. A different approach is arithmetic coding, which outputs a bit sequence representing a point inside an interval. The interval is build recursively by the probabilities of the encoded symbols.
Logo

Title

Description


Modeling for Text Compression

Although from 1988 this paper from Timothy Bell, Ian Witten and John Cleary is one of my favourites. It is easy to read, well structured and explains all important details. Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling which are suitable for use in practical text compression systems. The strategies fall into three main classes: finitecontext modeling, in which the last few characters are used to condition the probability distribution for the next one; finitestate modeling, in which the distribution is conditioned by the current state (and which subsumes finitecontext modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.


Compression: Algorithms: Statistical Coders

A good introduction into entropy coding is article from Charles Bloom in 1996. The process of statistical coding is explained with many simple examples.


Solving the Problems of Context Modeling

This paper from Charles Bloom in 1998 is about the PPMZ algorithm. It handles local order estimation and secondary escape estimation.


New Techniques in Context Modeling and Arithmetic Encoding

Charles Bloom presents 1996 several new techniques on high order context modeling, low order context modeling, and order0 arithmetic coding. Emphasis is placed on economy of memory and speed. Performance is found to be significantly better than previous methods.


Arithmetische Kodierung (Proseminar Datenkompression)

A well structured description of the ideas, background and implementation of arithmetic codeing in German from 2002 by Eric Bodden, Malte Clasen and Joachim Kneis. Good explanation of the renormalisation process and with complete source code. Very recommendable for German readers.


Is Huffman Coding Dead?

A paper from 1993 written by Abraham Bookstein and Shmuel Klein about the advantages of Huffman codes against arithmetic coding, especially the speed and robustness against errors.


Arithmetic Coding by Campos

A short description about arithmetic coding from 1999 written by Arturo Campos with a little example.


Canonical Huffman by Campos

Arturo Campos describes Canonical Huffman Coding in his article from 1999 with some examples.


Inductive Modeling for Data Compression

John Cleary and Ian Witten wrote this basic paper about modeling, parsing, prediction, context and state in 1987.


Arithmetic Coding by the Data Compression Reference Center

A brief description of arithmetic coding from 2000. Easy to read, with figures and examples.


Context Modelling for Text Compression

Several modeling strategies and algorithms are presented in 1992 by the paper of Daniel Hirschberg and Debra Lelewer. It contains a very interesting blending strategy.


The Design and Analysis of Efficient Lossless Data Compression Systems

The thesis of Paul Howard from 1993 about data compression algorithms with emphasis on arithmetic coding, text and image compression.


Arithmetic Coding for Data Compression

Paul Howard and Jeffrey Vitter describe an efficient implementation which uses table lookups in the article from 1994.


Analysis of Arithmetic Coding for Data Compression

In their article from 1992 Paul Howard and Jeffrey Vitter analyse arithmetic coding and entroduce the concept of weighted entropy.


Practical Implementations of Arithmetic Coding

A tutorial on arithmetic coding from 1992 by Paul Howard and Jeffrey Vitter with table lookups for higher speed.


Data Compression (Tutorial)

A basic paper from Debra Lelewer and Daniel Hirschberg about fundametal concepts of data compression, intended as a tutorial from 1987. Contains many small examples.


Streamlining Context Models for Data Compression

This paper from 1991 was written by Debra Lelewer and Daniel Hirschberg and is about context modeling using self organizing lists to speed up the compression process.


Lossless Compression Algorithms (Entropy Encoding)

Several nice and short articles written by Dave Marshall from 2001 about entropy coding with many examples.


Range encoding: an algorithm for removing redundancy from a digitised message

Range encoding was first proposed by this paper from G. Martin in 1979, which describes the algorithm not very clearly.


Lossless Compression for Text and Images

Again a basic paper about modeling and coding with models for text and image compression, written by Alistair Moffat, Timothy Bell and Ian Witten in 1995.


Arithmetic Coding Revisited

Together with the CACM87 paper this 1998 paper from Alistair Moffat, Radford Neal and Ian Witten is very well known. Improves the CACM87 implementation by using fewer multiplications and a wider range of symbol probabilities.


Arithmetic Coding + Statistical Modeling = Data Compression

Mark Nelson's article about arithmetic coding from 1991. The concepts are easy to understand and accompanied by a simple "BILL GATES" example. Source code for Billyboy is available.


Arithmetic Coding for Data Compression

This ACM paper from 1987, written by Ian Witten, Radford Neal and John Cleary, is the definite frontrunner of all arithmetic coding papers. The article is quite short but comes with full source code for the famous CACM87 AC implementation.

Logo

Name

Description


Timothy Bell

Timothy Bell works at the University of Canterbury, New Zealand, and is "father" of the Canterbury Corpus. His research interests include compression, computer science for children, and music.


Charles Bloom

Charles Bloom has published many papers about data compression and is author of PPMZ2, a very strong compression algorithm (2.141 bps on the Calgary Corpus)


Eric Bodden

Eric Bodden is a student of the RWTH Aachen, Germany, and currently studying at the University of Kent at Canterbury. He started a small online business called Communic Arts in November 1999.


Abraham Bookstein

Abraham Bookstein works at the University of Chicago, United States of America, and has published several compression papers together with Shmuel Klein.


Arturo Campos

Arturo Campos is a student and programmer, interested in data compression, and has written several articles about data compression.


Malte Clasen

Malte Clasen is a student of the RWTH Aachen, Germany, and is known as "the update" in the demoscene, a community of people whose target is to demonstrate their coding, drawing and composing skills in small programs called demos that have no purpose except posing.


John Cleary

John Cleary works at the University of Waikato, New Zealand, and has published several well known papers together with Ian Witten and Timothy Bell.


Daniel Hirschberg

Daniel Hirschberg is working at the University of California, United States of America. He is interested in the theory of design and analysis of algorithms.


Paul Howard

Paul Howard is working at the Eastern Michigan University, United States of America, and is engaged in the arithmetic coding filed since 10 years.


Shmuel Klein

Shmuel Tomi Klein is working at the BarIlan University, Israel, and has published several compression papers together with Abraham Bookstein.


Joachim Kneis

Joachim Kneis studies Computer Science at the RWTH Aachen, Germany, and like to play "Unreal Tournament".


Mikael Lundqvist

Mikael is interested in data compression, experimental electronic music and has written a BWT implementation, an improved range coder, a faster sort algorithm and a modified MTF scheme.


Dave Marshall

Dave Marshall works at the Cardiff University, United Kingdom. He is interested in music and has several compression articles on his multimedia internet site.


G. Martin

G. Martin is the author of the first range coder paper presented on the Data Recording Conference in 1979.


Alistair Moffat

Alistair Moffat is working at the University of Melbourne, Australia. Together with Ian Witten and Timothy Bell he is author of the book "Managing Gigabytes".


Radford Neal

Radford Neal works at the University of Toronto, Canada. He is one of the authors of the CACM87 implementation, which sets the standard in aritmetic coding.


Mark Nelson

Mark is the author of the famous compression site www.datacompression.info and has published articles in the data compression field for over ten years. He is an editor of the Dr. Dobb's Journal and author of the book "The Data Compression Book". He lives in the friendly Lone Star State Texas ("All My Ex's"...).


Michael Schindler

Michael Schindler is an independent compression consultant in Austria and the author of szip and a range coder.


Jeffrey Vitter

Jeffrey Vitter works at the Purdue University, United States of America. He published several data compression papers, some of them together with Paul Howard.


Ian Witten

Ian is working at the University of Waikato, New Zealand. Together with John Cleary and Timothy Bell he published "Modeling for Text Compression".




