Open Access
Subscription Access
Grammar-based Pre-processing for PPM
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching (PPM) compression algorithm. This achieves significantly better compression for different natural language texts compared to other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly improved results in compression for various natural languages (a 5% improvement for American English, 10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We describe further improvements using a two pass scheme where the grammar-based pre-processing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression, between 11% and 20%, when compared with other compression algorithms, including a grammar-based approach, the Sequitur algorithm.
Keywords
CFG, Grammar-Based, Preprocessing, PPM, Encoding.
User
Font Size
Information
- J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,” Commun. IEEE Trans., vol. 32, no. 4, pp. 396–402, 1984.
- A. Moffat, “Implementing the PPM data compression scheme,” IEEE Trans. Commun., vol. 38, no. 11, pp. 1917–1921, 1990.
- P. Howard, “The design and analysis of efficient lossless data compression systems,” Ph.D. dissertation, Dept. Comput. Sci.,Brown Univ., Providence, RI, Jun. 1993.
- J. Cleary and Teahan, W. “Unbounded Length Contexts for PPM,” Computing Journal, vol. 40, nos. 2 and 3, pp. 67–75, Feb. 1997.
- C. Bloom. “Solving the problems of context modeling.” Informally published report, see http://www.cbloom.com/papers. 1998.
- D. Shkarin. “PPM: One step to practicality”. Proc. Data Compression Conference, pp. 202-211, 2002. IEEE.
- W. Teahan, “Modelling English text,” Ph.D. dissertation, School of Computer Science, University of Waikato, 1998.
- Witten, I., Neal, R. & Cleary, J. “Arithmetic coding for data compression”. Communications of the ACM, vol. 30 Issue 6, June 1987.
- W. Teahan and K. M. Alhawiti, “Design, compilation and preliminary statistics of compression corpus of written Arabic,” Technical Report, Bangor University, School of Computer Science, 2013.
- J. Kieffer and E. Yang, “Grammar-based codes: a new class of universal lossless source codes,” Inf.
- Theory, IEEE Trans., vol. 46, no. 3, pp. 737–754, May. 2000.
- C. Nevill-Manning and I. Witten,“Identifying hierarchical structure in sequences: A linear-time algorithm,” J. Artif. Intell. Res.(JAIR), vol. 7, pp. 67–82, 1997.
- N. Larsson and A. Moffat, “Off-line dictionary-based compression,” Proc. IEEE, vol. 88, pp. 1722– 1732, Nov. 2000.
- J. Abel and W. Teahan, “Universal text pre-processing for data compression,” IEEE Transactions on Computers, 54.5: 497-507, 2005.
- W. Francis, W. and Kucera, H. “Brown corpus manual.” Brown University. 1979.
- S. Johansson. “The tagged LOB Corpus: User ́s Manual.” 1986.
- W. Teahan and K. Alhawiti,“pre-processing for PPM: Compressing UTF-8 encoded natural language text,” Int. J. Comput., vol. 7, no. 2, pp. 41–51, Apr. 2015.
- Ale Ahmad et al., “Hamshahri: A standard Persian text collection,” Knowledge-Based System, vol.
- , no. 5, pp. 382–387, 2009.
- A. McEnery and Z. Xiao, “The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study,” Religion, vol. 17, pp. 3–4, 2004.
- N.C. Ellis et al., “Cronfa Electroneg o Gymraeg (CEG): a 1 million word lexical database and frequency count for Welsh,” 2001.
- C. Nevill-Manning and I. Witten, “Compression and explanation using hierarchical grammars,” Comput. J., vol. 40, no. 2/3, pp 103–116, 1997.
Abstract Views: 392
PDF Views: 160