Silesia Corpus

www.data-compression.info
The Data Compression Resource on the Internet

Silesia Corpus

Silesian University of Technology

The Silesia Corpus was developed by Sebastian Deorowicz in 2003 at the Silesian University of Technology, Poland.
The intention of the Silesia corpus is to provide a data set of files that covers the typical data types used nowadays. The sizes of the files are between 6 MB and 51 MB.
The older Calgary and Canterbury corpora have some disadvantages as Sebastian states:
- the lack of large files—at present we work with much larger files.
- an over-representation of English-language texts—there are only English files in the three corpora, while in practice many texts are written in different languages.
- the lack of files being a concatenation of large projects (e.g., programming projects)—the application sizes grow quite fast and compressing each of the source files separately is impractical presently; a more convenient way is to concatenate the whole project and to compress the resulting file.
- absence of medical images—the medical images must not undergo a lossy compression because of law regulations.
- the lack of databases that currently grow considerably fast—databases are perhaps the fastest growing type of data.
On his page, Sebastion also presents some results of leading compression methods on the Silesia Corpus.

The corpus is available below and at http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia.

Publications

Logo

Title

Description

Universal lossless data compression algorithms

The disseration of Sebastian Deorowicz concerns universal lossless data compression algorithms such as LZ, PPM, and BWCA methods. A new algorithm based on the Burrows–Wheeler transform is proposed.
Its most important features are improved Itoh–Tanaka method for computing BWT, Weighted Frequency Count transform (instead of the MTF), and weighted probability estimation. The performance of the algorithm is evaluated on the Calgary and Silesia corpora.

People

Logo

Name

Description

Sebastian Deorowicz

Sebastian is the author of the Weigthed Frequency Count algorithm (WFC) and an Assistant Professor of the Silesian University of Technology, Poland.

Source Code

Logo

Title

Description

The Silesia Corpus

ZIP-file with all files of the corpus.

Copyright © 2002-2022 Dr.-Ing. Jürgen Abel, Lechstraße 1, 41469 Neuß, Germany. All rights reserved.