The Silesia Corpus was developed by Sebastian Deorowicz in 2003 at the Silesian University of Technology, Poland. The intention of the Silesia corpus is to provide a data set of files that covers the typical data types used nowadays. The sizes of the files are between 6 MB and 51 MB. The older Calgary and Canterbury corpora have some disadvantages as Sebastian states: - the lack of large files—at present we work with much larger files. - an over-representation of English-language texts—there are only English files in the three corpora, while in practice many texts are written in different languages. - the lack of files being a concatenation of large projects (e.g., programming projects)—the application sizes grow quite fast and compressing each of the source files separately is impractical presently; a more convenient way is to concatenate the whole project and to compress the resulting file. - absence of medical images—the medical images must not undergo a lossy compression because of law regulations. - the lack of databases that currently grow considerably fast—databases are perhaps the fastest growing type of data. On his page, Sebastion also presents some results of leading compression methods on the Silesia Corpus.
The disseration of Sebastian Deorowicz concerns universal lossless data compression algorithms such as LZ, PPM, and BWCA methods. A new algorithm based on the Burrows–Wheeler transform is proposed. Its most important features are improved Itoh–Tanaka method for computing BWT, Weighted Frequency Count transform (instead of the MTF), and weighted probability estimation. The performance of the algorithm is evaluated on the Calgary and Silesia corpora.