Preprocessing for PPM: Compressing UTF-8 Encoded Natural Language Text

William J. Teahan; Khaled M. Alhawiti

Preprocessing for PPM: Compressing UTF-8 Encoded Natural Language Text

William J. Teahan ¹, Khaled M. Alhawiti ²

Affiliations
1 School of Computer Science, University of Wales, Bangor, United Kingdom
2 School of Computes and Information Technology, University of Tabuk, Saudi Arabia

Abstract
References
Article Metrics
Refbacks

In this paper, several new universal preprocessing techniques are described to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially adjust the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text. Firstly, a simple bigraphs (two-byte) substitution technique is described that leads to significant improvement in compression for many languages when they are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian, 9% for Persian, 15% for Russian, 1% for Chinese text, and over 5% for both English and Welsh text). Secondly, a new preprocessing technique that outputs separate vocabulary and symbols streams - that are subsequently encoded separately - is also investigated. This also leads to significant improvement in compression for many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally, novel preprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text are described for dotted and non-dotted forms of the language.