Refine your search
Collections
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Teahan, William J.
- Preprocessing for PPM: Compressing UTF-8 Encoded Natural Language Text
Abstract Views :178 |
PDF Views:123
Authors
Affiliations
1 School of Computer Science, University of Wales, Bangor, GB
2 School of Computes and Information Technology, University of Tabuk, SA
1 School of Computer Science, University of Wales, Bangor, GB
2 School of Computes and Information Technology, University of Tabuk, SA
Source
AIRCC's International Journal of Computer Science and Information Technology, Vol 7, No 2 (2015), Pagination: 41-51Abstract
In this paper, several new universal preprocessing techniques are described to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially adjust the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text. Firstly, a simple bigraphs (two-byte) substitution technique is described that leads to significant improvement in compression for many languages when they are encoded by the Unicode scheme (25% for Arabic text, 14% for Armenian, 9% for Persian, 15% for Russian, 1% for Chinese text, and over 5% for both English and Welsh text). Secondly, a new preprocessing technique that outputs separate vocabulary and symbols streams - that are subsequently encoded separately - is also investigated. This also leads to significant improvement in compression for many languages (24% for Arabic text, 30% for Armenian, 32% for Persian and 35% for Russian). Finally, novel preprocessing and postprocessing techniques for lossy and lossless text compression of Arabic text are described for dotted and non-dotted forms of the language.Keywords
Preprocessing, PPM, UTF-8, Encoding.- Grammar-based Pre-processing for PPM
Abstract Views :229 |
PDF Views:124
Authors
Affiliations
1 Department of Computer Science, University of Bangor Bangor, GB
2 Department of Computer Science, University of Tabuk, Tabuk, SA
1 Department of Computer Science, University of Bangor Bangor, GB
2 Department of Computer Science, University of Tabuk, Tabuk, SA
Source
AIRCC's International Journal of Computer Science and Information Technology, Vol 9, No 1 (2017), Pagination: 1-11Abstract
In this paper, we apply grammar-based pre-processing prior to using the Prediction by Partial Matching (PPM) compression algorithm. This achieves significantly better compression for different natural language texts compared to other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to the compression. This leads to significantly improved results in compression for various natural languages (a 5% improvement for American English, 10% for British English, 29% for Welsh, 10% for Arabic, 3% for Persian and 35% for Chinese). We describe further improvements using a two pass scheme where the grammar-based pre-processing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression, between 11% and 20%, when compared with other compression algorithms, including a grammar-based approach, the Sequitur algorithm.Keywords
CFG, Grammar-Based, Preprocessing, PPM, Encoding.References
- J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,” Commun. IEEE Trans., vol. 32, no. 4, pp. 396–402, 1984.
- A. Moffat, “Implementing the PPM data compression scheme,” IEEE Trans. Commun., vol. 38, no. 11, pp. 1917–1921, 1990.
- P. Howard, “The design and analysis of efficient lossless data compression systems,” Ph.D. dissertation, Dept. Comput. Sci.,Brown Univ., Providence, RI, Jun. 1993.
- J. Cleary and Teahan, W. “Unbounded Length Contexts for PPM,” Computing Journal, vol. 40, nos. 2 and 3, pp. 67–75, Feb. 1997.
- C. Bloom. “Solving the problems of context modeling.” Informally published report, see http://www.cbloom.com/papers. 1998.
- D. Shkarin. “PPM: One step to practicality”. Proc. Data Compression Conference, pp. 202-211, 2002. IEEE.
- W. Teahan, “Modelling English text,” Ph.D. dissertation, School of Computer Science, University of Waikato, 1998.
- Witten, I., Neal, R. & Cleary, J. “Arithmetic coding for data compression”. Communications of the ACM, vol. 30 Issue 6, June 1987.
- W. Teahan and K. M. Alhawiti, “Design, compilation and preliminary statistics of compression corpus of written Arabic,” Technical Report, Bangor University, School of Computer Science, 2013.
- J. Kieffer and E. Yang, “Grammar-based codes: a new class of universal lossless source codes,” Inf.
- Theory, IEEE Trans., vol. 46, no. 3, pp. 737–754, May. 2000.
- C. Nevill-Manning and I. Witten,“Identifying hierarchical structure in sequences: A linear-time algorithm,” J. Artif. Intell. Res.(JAIR), vol. 7, pp. 67–82, 1997.
- N. Larsson and A. Moffat, “Off-line dictionary-based compression,” Proc. IEEE, vol. 88, pp. 1722– 1732, Nov. 2000.
- J. Abel and W. Teahan, “Universal text pre-processing for data compression,” IEEE Transactions on Computers, 54.5: 497-507, 2005.
- W. Francis, W. and Kucera, H. “Brown corpus manual.” Brown University. 1979.
- S. Johansson. “The tagged LOB Corpus: User ́s Manual.” 1986.
- W. Teahan and K. Alhawiti,“pre-processing for PPM: Compressing UTF-8 encoded natural language text,” Int. J. Comput., vol. 7, no. 2, pp. 41–51, Apr. 2015.
- Ale Ahmad et al., “Hamshahri: A standard Persian text collection,” Knowledge-Based System, vol.
- , no. 5, pp. 382–387, 2009.
- A. McEnery and Z. Xiao, “The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study,” Religion, vol. 17, pp. 3–4, 2004.
- N.C. Ellis et al., “Cronfa Electroneg o Gymraeg (CEG): a 1 million word lexical database and frequency count for Welsh,” 2001.
- C. Nevill-Manning and I. Witten, “Compression and explanation using hierarchical grammars,” Comput. J., vol. 40, no. 2/3, pp 103–116, 1997.
- Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Abstract Views :258 |
PDF Views:115
Authors
Affiliations
1 Department of Computer Sciences And Engineering, University of Hail, SA
2 Department of Computer Science, University of Bangor, Bangor, GB
1 Department of Computer Sciences And Engineering, University of Hail, SA
2 Department of Computer Science, University of Bangor, Bangor, GB
Source
AIRCC's International Journal of Computer Science and Information Technology, Vol 9, No 2 (2017), Pagination: 131-140Abstract
In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching (PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and authorship respectively.Keywords
Arabic Text Categorisation, Data Compression, Machine Learning Algorithms.References
- O. Coban, B. Ozyer, and G. T. Ozyer, “A Comparison of Similarity Metrics for Sentiment Analysis on Turkish Twitter Feeds,” in Smart City/SocialCom/SustainCom (SmartCity), 2015 IEEE International Conference on, 2015, pp. 333–338.
- H. Ta’amneh, E. A. Keshek, M. B. Issa, M. Al-Ayyoub, and Y. Jararweh, “Compression-based Arabic text classification,” in Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on, 2014, pp. 594–600.
- E. Frank, C. Chui, and I. H. Witten, “Text categorization using compression models,”Waikato University, 2000.
- W. J. Teahan and D. J. Harper, “Using compression-based language models for text categorization,” in Language modeling for information retrieval, Springer, 2003, pp. 141–165.
- J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,” IEEE Trans. Commun., vol. 32, no. 4, pp. 396–402, 1984.
- T. Bell, I. H. Witten, and J. G. Cleary, “Modeling for text compression,” ACM Computing Surveys, vol. 21, no. 4, pp. 557–591, 1989.
- M. A. Alghamdi, I. S. Alkhazi, and W. J. Teahan, “Arabic OCR evaluation tool,” in Computer Science and Information Technology (CSIT), 2016 7th International Conference on, 2016, pp. 1–6.
- A. S. House and E. P. Neuburg, “Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations,” J. Acoust. Soc. Am., vol. 62, no. 3, pp. 708–713, 1977.
- W. B. Cavnar, J. M. Trenkle, and A. A. Mi, “N-Gram-Based Text Categorization,” Ann Arbor MI 48113.2, pp. 161–175, 1994.
- J. Nerbonne, W. Heeringa, and P. Kleiweg, “Comparison and classification of dialects,” in Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, 1999, pp. 281–282.
- D. P. Branner, Problems in comparative Chinese dialectology: the classification of Miin and Hakka, vol. 123. Walter de Gruyter, 2000.
- O. F. Zaidan and C. Callison-Burch, “Arabic dialect identification,” Computational Linguistics., vol. 40, no. 1, pp. 171–202, 2014.
- E. P. Sanz, J. M. G. Hidalgo, and J. C. C. Pérez, “Email spam filtering,” Advances Computers, vol. 74, pp. 45–114, 2008.
- A. Bratko, G. V Cormack, B. Filipič, T. R. Lynam, and B. Zupan, “Spam filtering using statistical data compression models,” J. Mach. Learn. Res., vol. 7, no. Dec, pp. 2673–2698, 2006.
- B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 2002, pp. 79–86.
- A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau, “Sentiment analysis of twitter data,” in Proceedings of the workshop on languages in social media, 2011, pp. 30–38.
- A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant supervision,” CS224N Proj. Report, Stanford, vol. 1, p. 12, 2009.
- X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang, “Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach,” in Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pp. 1031–1040.
- A. M. Qamar, S. A. Alsuhibany, and S. S. Ahmed, “Sentiment Classification of Twitter Data Belonging to Saudi Arabian Telecommunication Companies,” Int. J. Adv. Comput. Sci. Appl., vol. 1, no. 8, pp. 395–401, 2017.
- A. Castro and B. Lindauer, “Author Identification on Twitter.” 2012.
- R. Layton, P. Watters, and R. Dazeley, “Authorship attribution for twitter in 140 characters or less,” in Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second, 2010, pp. 1–8.
- R. M. Duwairi, “Machine learning for Arabic text categorization,” J. Am. Soc. Inf. Sci. Technol., vol. 57, no. 8, pp. 1005–1010, 2006.
- S. Alsaleem, “Automated Arabic Text Categorization Using SVM and NB,” vol. 2, no. 2, pp. 124–128, 2011.
- M. Bekkali and A. Lachkar, “ARABIC TWEETS CATEGORIZATION BASED ON ROUGH SET THEORY,” Int. J. Comput. Sci. Inf. Technol., vol. 6, 2014.
- W. A. Hussien, Y. M. Tashtoush, M. Al-Ayyoub, and M. N. Al-Kabi, “Are emoticons good enough to train emotion classifiers of arabic tweets?,” in Computer Science and Information Technology (CSIT), 2016 7th International Conference on, 2016, pp. 1–6.
- A. Alabdullatif, B. Shahzad, and E. Alwagait, “Classification of Arabic Twitter Users: A Study Based on User Behaviour and Interests,” Mob. Inf. Syst., vol. 2016, 2016.
- A. Alwajeeh, M. Al-Ayyoub, and I. Hmeidi, “On authorship authentication of arabic articles,” in Information and Communication Systems (ICICS), 2014 5th International Conference on, 2014, pp. 1–6.
- A. S. Altheneyan and M. E. B. Menai, “Naïve Bayes classifiers for authorship attribution of Arabic texts,” J. King Saud Univ. Inf. Sci., vol. 26, no. 4, pp. 473–484, 2014.
- J. Albadarneh, B. Talafha, M. Al-Ayyoub, B. Zaqaibeh, M. Al-Smadi, Y. Jararweh, and E. Benkhelifa, “Using big data analytics for authorship authentication of arabic tweets,” in Utility and Cloud Computing (UCC), 2015 IEEE/ACM 8th International Conference on, 2015, pp. 448–452.
- K. Alsmearat, M. Al-Ayyoub, and R. Al-Shalabi, “An extensive study of the bag-of-words approach for gender identification of arabic articles,” in Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on, 2014, pp. 601–608.
- K. Alsmearat, M. Shehab, M. Al-Ayyoub, R. Al-Shalabi, and G. Kanaan, “Emotion analysis of arabic articles and its impact on identifying the author’s gender,” in Computer Systems and Applications (AICCSA), 2015 IEEE/ACS 12th International Conference of, 2015, pp. 1–6.
- Tweepy, “Tweepy,” Tweepy.org. [Online]. Available: Tweepy.org. [Accessed: 05-Mar-2016].
- Majeed Timraz, “kotobji,” Twitter, 2012. [Online]. Available: https://twitter.com/majeedtimraz0. [Accessed: 07-Apr-2017].
- K. M. Alhwaiti, “Adaptive Models of Arabic Text,” Bangor University, 2014.
- A. Moffat, “Implementing the PPM data compression scheme,” IEEE Trans. Commun., vol. 38, no. 11, pp. 1917–1921, 1990.
- P. G. Howard, “The Design and Analysis of E cient Lossless Data Compression Systems.” Diss. PhD thesis, Brown University, 1993.
- J. G. Cleary and W. J. Teahan, “Unbounded length contexts for PPM,” Comput. J., vol. 40, no. 2 and 3, pp. 67–75, 1997.
- P. Wu and W. J. Teahan, “A new PPM variant for Chinese text compression,” Nat. Lang. Eng., vol. 14, no. 3, pp. 417–430, 2008.
- W. J. Teahan, “Adaptive Models of English Text,” Waikato University, 1998.
- W. J. Teahan and D. J. Harper, “Combining PPM models using a text mining approach,” in Data Compression Conference, 2001. Proceedings. DCC 2001., 2001, pp. 153–162.
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” ACM SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18, 2009.