Open Access Open Access  Restricted Access Subscription Access

Grammar-based Pre-processing for PPM


Affiliations
1 Department of Computer Science, University of Bangor Bangor, United Kingdom
2 Department of Computer Science, University of Tabuk, Tabuk, Saudi Arabia
 

In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching (PPM) compression algorithm. This achieves significantly better compression for different natural language texts compared to other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly improved results in compression for various natural languages (a 5% improvement for American English, 10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We describe further improvements using a two pass scheme where the grammar-based pre-processing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression, between 11% and 20%, when compared with other compression algorithms, including a grammar-based approach, the Sequitur algorithm.

Keywords

CFG, Grammar-Based, Preprocessing, PPM, Encoding.
User
Notifications
Font Size

  • J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,” Commun. IEEE Trans., vol. 32, no. 4, pp. 396–402, 1984.
  • A. Moffat, “Implementing the PPM data compression scheme,” IEEE Trans. Commun., vol. 38, no. 11, pp. 1917–1921, 1990.
  • P. Howard, “The design and analysis of efficient lossless data compression systems,” Ph.D. dissertation, Dept. Comput. Sci.,Brown Univ., Providence, RI, Jun. 1993.
  • J. Cleary and Teahan, W. “Unbounded Length Contexts for PPM,” Computing Journal, vol. 40, nos. 2 and 3, pp. 67–75, Feb. 1997.
  • C. Bloom. “Solving the problems of context modeling.” Informally published report, see http://www.cbloom.com/papers. 1998.
  • D. Shkarin. “PPM: One step to practicality”. Proc. Data Compression Conference, pp. 202-211, 2002. IEEE.
  • W. Teahan, “Modelling English text,” Ph.D. dissertation, School of Computer Science, University of Waikato, 1998.
  • Witten, I., Neal, R. & Cleary, J. “Arithmetic coding for data compression”. Communications of the ACM, vol. 30 Issue 6, June 1987.
  • W. Teahan and K. M. Alhawiti, “Design, compilation and preliminary statistics of compression corpus of written Arabic,” Technical Report, Bangor University, School of Computer Science, 2013.
  • J. Kieffer and E. Yang, “Grammar-based codes: a new class of universal lossless source codes,” Inf.
  • Theory, IEEE Trans., vol. 46, no. 3, pp. 737–754, May. 2000.
  • C. Nevill-Manning and I. Witten,“Identifying hierarchical structure in sequences: A linear-time algorithm,” J. Artif. Intell. Res.(JAIR), vol. 7, pp. 67–82, 1997.
  • N. Larsson and A. Moffat, “Off-line dictionary-based compression,” Proc. IEEE, vol. 88, pp. 1722– 1732, Nov. 2000.
  • J. Abel and W. Teahan, “Universal text pre-processing for data compression,” IEEE Transactions on Computers, 54.5: 497-507, 2005.
  • W. Francis, W. and Kucera, H. “Brown corpus manual.” Brown University. 1979.
  • S. Johansson. “The tagged LOB Corpus: User ́s Manual.” 1986.
  • W. Teahan and K. Alhawiti,“pre-processing for PPM: Compressing UTF-8 encoded natural language text,” Int. J. Comput., vol. 7, no. 2, pp. 41–51, Apr. 2015.
  • Ale Ahmad et al., “Hamshahri: A standard Persian text collection,” Knowledge-Based System, vol.
  • , no. 5, pp. 382–387, 2009.
  • A. McEnery and Z. Xiao, “The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study,” Religion, vol. 17, pp. 3–4, 2004.
  • N.C. Ellis et al., “Cronfa Electroneg o Gymraeg (CEG): a 1 million word lexical database and frequency count for Welsh,” 2001.
  • C. Nevill-Manning and I. Witten, “Compression and explanation using hierarchical grammars,” Comput. J., vol. 40, no. 2/3, pp 103–116, 1997.

Abstract Views: 394

PDF Views: 160




  • Grammar-based Pre-processing for PPM

Abstract Views: 394  |  PDF Views: 160

Authors

William J. Teahan
Department of Computer Science, University of Bangor Bangor, United Kingdom
Nojood O. Aljehane
Department of Computer Science, University of Tabuk, Tabuk, Saudi Arabia

Abstract


In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching (PPM) compression algorithm. This achieves significantly better compression for different natural language texts compared to other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly improved results in compression for various natural languages (a 5% improvement for American English, 10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We describe further improvements using a two pass scheme where the grammar-based pre-processing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression, between 11% and 20%, when compared with other compression algorithms, including a grammar-based approach, the Sequitur algorithm.

Keywords


CFG, Grammar-Based, Preprocessing, PPM, Encoding.

References