www.data-compression.info The Data Compression Resource on the Internet
Entropy Coding (EC)
The process of entropy coding (EC) can be split in two parts: modeling and coding. Modeling assigns probabilities to the symbols, and coding produces a bit sequence from these probabilities. As established in Shannon's source coding theorem, there is a relationship between the symbol probability and it's corresponding bit sequence. A symbol with probabilitiy p gets a bit sequence of length -log(p). In order to achieve a good compression rate, an exact propability estimation is needed. Since the model is responsible for the probability of each symbol, modeling is one the most important tasks in data compression. Entropy coding can be done with a coding scheme, which uses a discrete number of bits for each symbol, for example Huffman coding, or with a coding scheme, which uses a discrete number of bits for several symbols. In the last case we get arithmetoc coding, if the number of symbols is equal to the total number of symbols to encode.
Although from 1988 this paper from Timothy Bell, Ian Witten and John Cleary is one of my favourites. It is easy to read, well structured and explains all important details. Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling which are suitable for use in practical text compression systems. The strategies fall into three main classes: finite-context modeling, in which the last few characters are used to condition the probability distribution for the next one; finite-state modeling, in which the distribution is conditioned by the current state (and which subsumes finite-context modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.
Charles Bloom presents 1996 several new techniques on high order context modeling, low order context modeling, and order-0 arithmetic coding. Emphasis is placed on economy of memory and speed. Performance is found to be significantly better than previous methods.
A well structured description of the ideas, background and implementation of arithmetic codeing in German from 2002 by Eric Bodden, Malte Clasen and Joachim Kneis. Good explanation of the renormalisation process and with complete source code. Very recommendable for German readers.
Together with the CACM87 paper this 1998 paper from Alistair Moffat, Radford Neal and Ian Witten is very well known. Improves the CACM87 implementation by using fewer multiplications and a wider range of symbol probabilities.
This ACM paper from 1987, written by Ian Witten, Radford Neal and John Cleary, is the definite front-runner of all arithmetic coding papers. The article is quite short but comes with full source code for the famous CACM87 AC implementation.
Malte Clasen is a student of the RWTH Aachen, Germany, and is known as "the update" in the demoscene, a community of people whose target is to demonstrate their coding, drawing and composing skills in small programs called demos that have no purpose except posing.
Mark is the author of the famous compression site www.datacompression.info and has published articles in the data compression field for over ten years. He is an editor of the Dr. Dobb's Journal and author of the book "The Data Compression Book". He lives in the friendly Lone Star State Texas ("All My Ex's"...).
Range coder source code from Michael Schindler, which is one of my favourite range coder implementations. A range coder is working similary to an arithmetic coder but uses less renormalisations and a faster byte output.