The Protein Corpus is a set of 4 files, which were used in the article "Protein is incompressible" by Craig Nevill-Manning and Ian Witten from the DCC 1999. These files are HI for Haemophilus Influenzae, HS for Homo Sapiens, MJ for Methanococcus Jannaschii and SC for Saccharomyces Cerevisiae. Protein is difficult to compress since there is little Markov dependency in protein. Craig and Ian have developed a special compression program called CP for Compress Protein, which achieves an average compression rate of 4.113 bps for the 4 files of the corpus. Since it is difficult to obtain a strong compression for these files, the Protein Corpus makes a good file set for the evaluation of compression algorithms.
The corpus is available below. Please note that Craig sent me a note saying that the file posted here before would not have been the same than in his paper. The files below from 12.06.2004 are now the correct ones.
An excerpt of the article "Protein is incompressible" describing the 4 files follows:
We used four genomes in our analysis. The first, Haemophilus influenzae (HI), is a bacterium that causes ear and respiratory infections in children. This genome was the first to be fully sequenced, and was made available in 1996. It is 1.83 megabases in size and contains approximately 1740 potential genes. When these genes are translated to proteins, the resulting file of amino acid sequences is 500Kb (representing each amino acid as one byte). The second genome, Saccharomyces cerevisiae (SC), or baker’s yeast, has been studied as a model organism for several decades. At 13 megabases, it is the largest organism sequenced to date. The file of 8,220 protein sequences from S. cerevisiae is 2.9 Mb in size. The third genome, Methanococcus jannaschii (MJ), lives in very hot undersea vents and has a unique metabolism. It is 1.7 megabases in size, and has 1680 genes for a protein file size of 450 Kb. The final genome, Homo sapiens (HS), is incomplete: it includes 5733 human genes, for a file size of 3.3Mb.
The paper which introduced the 4 protein files by Craig Nevill-Manning and Ian Witten from the DCC 1999. They state that protein files are not very high compressable with PPM like algorithms and introduce a special protein compression program called CP. Further investigation will have to find out, how much possibilities have BWCAs in the field of protein compression.
Craig works at the Department of Computer Science of Rutgers, the State University of New Jersey, United States of America. He published the article "Protein is incompressible" together with John Witten. His research interests includes information retrieval, bioinformatics, inferring sequential structure, machine learning and data compression.