Open Access Open Access  Restricted Access Subscription Access

Preprocessing for PPM: Compressing UTF-8 Encoded Natural Language Text


Affiliations
1 School of Computer Science, University of Wales, Bangor, United Kingdom
2 School of Computes and Information Technology, University of Tabuk, Saudi Arabia
 

In this paper, several new universal preprocessing techniques are described to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially adjust the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text. Firstly, a simple bigraphs (two-byte) substitution technique is described that leads to significant improvement in compression for many languages when they are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian, 9% for Persian, 15% for Russian, 1% for Chinese text, and over 5% for both English and Welsh text). Secondly, a new preprocessing technique that outputs separate vocabulary and symbols streams - that are subsequently encoded separately - is also investigated. This also leads to significant improvement in compression for many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally, novel preprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text are described for dotted and non-dotted forms of the language.

Keywords

Preprocessing, PPM, UTF-8, Encoding.
User
Notifications
Font Size

Abstract Views: 209

PDF Views: 137




  • Preprocessing for PPM: Compressing UTF-8 Encoded Natural Language Text

Abstract Views: 209  |  PDF Views: 137

Authors

William J. Teahan
School of Computer Science, University of Wales, Bangor, United Kingdom
Khaled M. Alhawiti
School of Computes and Information Technology, University of Tabuk, Saudi Arabia

Abstract


In this paper, several new universal preprocessing techniques are described to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially adjust the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text. Firstly, a simple bigraphs (two-byte) substitution technique is described that leads to significant improvement in compression for many languages when they are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian, 9% for Persian, 15% for Russian, 1% for Chinese text, and over 5% for both English and Welsh text). Secondly, a new preprocessing technique that outputs separate vocabulary and symbols streams - that are subsequently encoded separately - is also investigated. This also leads to significant improvement in compression for many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally, novel preprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text are described for dotted and non-dotted forms of the language.

Keywords


Preprocessing, PPM, UTF-8, Encoding.