Author Details

Scroll

Refine your search

Collections

Engineering Collection

Co-Authors

Journals

AIRCC's International Journal of Computer Science and Information Technology

Year

2015
2017

Authors

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All

Teahan, William J.

Preprocessing for PPM: Compressing UTF-8 Encoded Natural Language Text

Abstract Views :178 | PDF Views:123

Authors

William J. Teahan ¹, Khaled M. Alhawiti ²

Affiliations
1 School of Computer Science, University of Wales, Bangor, GB
2 School of Computes and Information Technology, University of Tabuk, SA

Source

AIRCC's International Journal of Computer Science and Information Technology, Vol 7, No 2 (2015), Pagination: 41-51

Abstract

In this paper, several new universal preprocessing techniques are described to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially adjust the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text. Firstly, a simple bigraphs (two-byte) substitution technique is described that leads to significant improvement in compression for many languages when they are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian, 9% for Persian, 15% for Russian, 1% for Chinese text, and over 5% for both English and Welsh text). Secondly, a new preprocessing technique that outputs separate vocabulary and symbols streams - that are subsequently encoded separately - is also investigated. This also leads to significant improvement in compression for many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally, novel preprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text are described for dotted and non-dotted forms of the language.

Keywords

Preprocessing, PPM, UTF-8, Encoding.

Full Text

Grammar-based Pre-processing for PPM

Abstract Views :229 | PDF Views:124

Authors

William J. Teahan ¹, Nojood O. Aljehane ²

Affiliations
1 Department of Computer Science, University of Bangor Bangor, GB
2 Department of Computer Science, University of Tabuk, Tabuk, SA

Source

AIRCC's International Journal of Computer Science and Information Technology, Vol 9, No 1 (2017), Pagination: 1-11

Abstract

In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching (PPM) compression algorithm. This achieves significantly better compression for different natural language texts compared to other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly improved results in compression for various natural languages (a 5% improvement for American English, 10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We describe further improvements using a two pass scheme where the grammar-based pre-processing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression, between 11% and 20%, when compared with other compression algorithms, including a grammar-based approach, the Sequitur algorithm.

Keywords

CFG, Grammar-Based, Preprocessing, PPM, Encoding.

Full Text

References

J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,” Commun. IEEE Trans., vol. 32, no. 4, pp. 396–402, 1984.

A. Moffat, “Implementing the PPM data compression scheme,” IEEE Trans. Commun., vol. 38, no. 11, pp. 1917–1921, 1990.

P. Howard, “The design and analysis of efficient lossless data compression systems,” Ph.D. dissertation, Dept. Comput. Sci.,Brown Univ., Providence, RI, Jun. 1993.

J. Cleary and Teahan, W. “Unbounded Length Contexts for PPM,” Computing Journal, vol. 40, nos. 2 and 3, pp. 67–75, Feb. 1997.

C. Bloom. “Solving the problems of context modeling.” Informally published report, see http://www.cbloom.com/papers. 1998.

D. Shkarin. “PPM: One step to practicality”. Proc. Data Compression Conference, pp. 202-211, 2002. IEEE.

W. Teahan, “Modelling English text,” Ph.D. dissertation, School of Computer Science, University of Waikato, 1998.

Witten, I., Neal, R. & Cleary, J. “Arithmetic coding for data compression”. Communications of the ACM, vol. 30 Issue 6, June 1987.

W. Teahan and K. M. Alhawiti, “Design, compilation and preliminary statistics of compression corpus of written Arabic,” Technical Report, Bangor University, School of Computer Science, 2013.

J. Kieffer and E. Yang, “Grammar-based codes: a new class of universal lossless source codes,” Inf.

Theory, IEEE Trans., vol. 46, no. 3, pp. 737–754, May. 2000.

C. Nevill-Manning and I. Witten,“Identifying hierarchical structure in sequences: A linear-time algorithm,” J. Artif. Intell. Res.(JAIR), vol. 7, pp. 67–82, 1997.

N. Larsson and A. Moffat, “Off-line dictionary-based compression,” Proc. IEEE, vol. 88, pp. 1722– 1732, Nov. 2000.

J. Abel and W. Teahan, “Universal text pre-processing for data compression,” IEEE Transactions on Computers, 54.5: 497-507, 2005.

W. Francis, W. and Kucera, H. “Brown corpus manual.” Brown University. 1979.

S. Johansson. “The tagged LOB Corpus: User ́s Manual.” 1986.

W. Teahan and K. Alhawiti,“pre-processing for PPM: Compressing UTF-8 encoded natural language text,” Int. J. Comput., vol. 7, no. 2, pp. 41–51, Apr. 2015.

Ale Ahmad et al., “Hamshahri: A standard Persian text collection,” Knowledge-Based System, vol.

, no. 5, pp. 382–387, 2009.

A. McEnery and Z. Xiao, “The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study,” Religion, vol. 17, pp. 3–4, 2004.

N.C. Ellis et al., “Cronfa Electroneg o Gymraeg (CEG): a 1 million word lexical database and frequency count for Welsh,” 2001.

C. Nevill-Manning and I. Witten, “Compression and explanation using hierarchical grammars,” Comput. J., vol. 40, no. 2/3, pp 103–116, 1997.

Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM

Abstract Views :258 | PDF Views:115

Authors

Mohammed Altamimi ¹, William J. Teahan ²

Affiliations
1 Department of Computer Sciences And Engineering, University of Hail, SA
2 Department of Computer Science, University of Bangor, Bangor, GB

Source

AIRCC's International Journal of Computer Science and Information Technology, Vol 9, No 2 (2017), Pagination: 131-140

Abstract

In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching (PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and authorship respectively.

Keywords

Arabic Text Categorisation, Data Compression, Machine Learning Algorithms.

Full Text

References

O. Coban, B. Ozyer, and G. T. Ozyer, “A Comparison of Similarity Metrics for Sentiment Analysis on Turkish Twitter Feeds,” in Smart City/SocialCom/SustainCom (SmartCity), 2015 IEEE International Conference on, 2015, pp. 333–338.

H. Ta’amneh, E. A. Keshek, M. B. Issa, M. Al-Ayyoub, and Y. Jararweh, “Compression-based Arabic text classification,” in Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on, 2014, pp. 594–600.

E. Frank, C. Chui, and I. H. Witten, “Text categorization using compression models,”Waikato University, 2000.

W. J. Teahan and D. J. Harper, “Using compression-based language models for text categorization,” in Language modeling for information retrieval, Springer, 2003, pp. 141–165.

J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,” IEEE Trans. Commun., vol. 32, no. 4, pp. 396–402, 1984.

T. Bell, I. H. Witten, and J. G. Cleary, “Modeling for text compression,” ACM Computing Surveys, vol. 21, no. 4, pp. 557–591, 1989.

M. A. Alghamdi, I. S. Alkhazi, and W. J. Teahan, “Arabic OCR evaluation tool,” in Computer Science and Information Technology (CSIT), 2016 7th International Conference on, 2016, pp. 1–6.

A. S. House and E. P. Neuburg, “Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations,” J. Acoust. Soc. Am., vol. 62, no. 3, pp. 708–713, 1977.

W. B. Cavnar, J. M. Trenkle, and A. A. Mi, “N-Gram-Based Text Categorization,” Ann Arbor MI 48113.2, pp. 161–175, 1994.

J. Nerbonne, W. Heeringa, and P. Kleiweg, “Comparison and classification of dialects,” in Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, 1999, pp. 281–282.

D. P. Branner, Problems in comparative Chinese dialectology: the classification of Miin and Hakka, vol. 123. Walter de Gruyter, 2000.

O. F. Zaidan and C. Callison-Burch, “Arabic dialect identification,” Computational Linguistics., vol. 40, no. 1, pp. 171–202, 2014.

E. P. Sanz, J. M. G. Hidalgo, and J. C. C. Pérez, “Email spam filtering,” Advances Computers, vol. 74, pp. 45–114, 2008.

A. Bratko, G. V Cormack, B. Filipič, T. R. Lynam, and B. Zupan, “Spam filtering using statistical data compression models,” J. Mach. Learn. Res., vol. 7, no. Dec, pp. 2673–2698, 2006.

B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 2002, pp. 79–86.

A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau, “Sentiment analysis of twitter data,” in Proceedings of the workshop on languages in social media, 2011, pp. 30–38.

A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant supervision,” CS224N Proj. Report, Stanford, vol. 1, p. 12, 2009.

X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang, “Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach,” in Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pp. 1031–1040.

A. M. Qamar, S. A. Alsuhibany, and S. S. Ahmed, “Sentiment Classification of Twitter Data Belonging to Saudi Arabian Telecommunication Companies,” Int. J. Adv. Comput. Sci. Appl., vol. 1, no. 8, pp. 395–401, 2017.

A. Castro and B. Lindauer, “Author Identification on Twitter.” 2012.

R. Layton, P. Watters, and R. Dazeley, “Authorship attribution for twitter in 140 characters or less,” in Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second, 2010, pp. 1–8.

R. M. Duwairi, “Machine learning for Arabic text categorization,” J. Am. Soc. Inf. Sci. Technol., vol. 57, no. 8, pp. 1005–1010, 2006.

S. Alsaleem, “Automated Arabic Text Categorization Using SVM and NB,” vol. 2, no. 2, pp. 124–128, 2011.

M. Bekkali and A. Lachkar, “ARABIC TWEETS CATEGORIZATION BASED ON ROUGH SET THEORY,” Int. J. Comput. Sci. Inf. Technol., vol. 6, 2014.

W. A. Hussien, Y. M. Tashtoush, M. Al-Ayyoub, and M. N. Al-Kabi, “Are emoticons good enough to train emotion classifiers of arabic tweets?,” in Computer Science and Information Technology (CSIT), 2016 7th International Conference on, 2016, pp. 1–6.

A. Alabdullatif, B. Shahzad, and E. Alwagait, “Classification of Arabic Twitter Users: A Study Based on User Behaviour and Interests,” Mob. Inf. Syst., vol. 2016, 2016.

A. Alwajeeh, M. Al-Ayyoub, and I. Hmeidi, “On authorship authentication of arabic articles,” in Information and Communication Systems (ICICS), 2014 5th International Conference on, 2014, pp. 1–6.

A. S. Altheneyan and M. E. B. Menai, “Naïve Bayes classifiers for authorship attribution of Arabic texts,” J. King Saud Univ. Inf. Sci., vol. 26, no. 4, pp. 473–484, 2014.

J. Albadarneh, B. Talafha, M. Al-Ayyoub, B. Zaqaibeh, M. Al-Smadi, Y. Jararweh, and E. Benkhelifa, “Using big data analytics for authorship authentication of arabic tweets,” in Utility and Cloud Computing (UCC), 2015 IEEE/ACM 8th International Conference on, 2015, pp. 448–452.

K. Alsmearat, M. Al-Ayyoub, and R. Al-Shalabi, “An extensive study of the bag-of-words approach for gender identification of arabic articles,” in Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on, 2014, pp. 601–608.

K. Alsmearat, M. Shehab, M. Al-Ayyoub, R. Al-Shalabi, and G. Kanaan, “Emotion analysis of arabic articles and its impact on identifying the author’s gender,” in Computer Systems and Applications (AICCSA), 2015 IEEE/ACS 12th International Conference of, 2015, pp. 1–6.

Tweepy, “Tweepy,” Tweepy.org. [Online]. Available: Tweepy.org. [Accessed: 05-Mar-2016].

Majeed Timraz, “kotobji,” Twitter, 2012. [Online]. Available: https://twitter.com/majeedtimraz0. [Accessed: 07-Apr-2017].

K. M. Alhwaiti, “Adaptive Models of Arabic Text,” Bangor University, 2014.

A. Moffat, “Implementing the PPM data compression scheme,” IEEE Trans. Commun., vol. 38, no. 11, pp. 1917–1921, 1990.

P. G. Howard, “The Design and Analysis of E cient Lossless Data Compression Systems.” Diss. PhD thesis, Brown University, 1993.

J. G. Cleary and W. J. Teahan, “Unbounded length contexts for PPM,” Comput. J., vol. 40, no. 2 and 3, pp. 67–75, 1997.

P. Wu and W. J. Teahan, “A new PPM variant for Chinese text compression,” Nat. Lang. Eng., vol. 14, no. 3, pp. 417–430, 2008.

W. J. Teahan, “Adaptive Models of English Text,” Waikato University, 1998.

W. J. Teahan and D. J. Harper, “Combining PPM models using a text mining approach,” in Data Compression Conference, 2001. Proceedings. DCC 2001., 2001, pp. 153–162.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” ACM SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18, 2009.

Username
Password
Remember me