Open Access Open Access  Restricted Access Subscription Access

Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM


Affiliations
1 Department of Computer Sciences And Engineering, University of Hail, Saudi Arabia
2 Department of Computer Science, University of Bangor, Bangor, United Kingdom
 

In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching (PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and authorship respectively.

Keywords

Arabic Text Categorisation, Data Compression, Machine Learning Algorithms.
User
Notifications
Font Size

  • O. Coban, B. Ozyer, and G. T. Ozyer, “A Comparison of Similarity Metrics for Sentiment Analysis on Turkish Twitter Feeds,” in Smart City/SocialCom/SustainCom (SmartCity), 2015 IEEE International Conference on, 2015, pp. 333–338.
  • H. Ta’amneh, E. A. Keshek, M. B. Issa, M. Al-Ayyoub, and Y. Jararweh, “Compression-based Arabic text classification,” in Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on, 2014, pp. 594–600.
  • E. Frank, C. Chui, and I. H. Witten, “Text categorization using compression models,”Waikato University, 2000.
  • W. J. Teahan and D. J. Harper, “Using compression-based language models for text categorization,” in Language modeling for information retrieval, Springer, 2003, pp. 141–165.
  • J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,” IEEE Trans. Commun., vol. 32, no. 4, pp. 396–402, 1984.
  • T. Bell, I. H. Witten, and J. G. Cleary, “Modeling for text compression,” ACM Computing Surveys, vol. 21, no. 4, pp. 557–591, 1989.
  • M. A. Alghamdi, I. S. Alkhazi, and W. J. Teahan, “Arabic OCR evaluation tool,” in Computer Science and Information Technology (CSIT), 2016 7th International Conference on, 2016, pp. 1–6.
  • A. S. House and E. P. Neuburg, “Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations,” J. Acoust. Soc. Am., vol. 62, no. 3, pp. 708–713, 1977.
  • W. B. Cavnar, J. M. Trenkle, and A. A. Mi, “N-Gram-Based Text Categorization,” Ann Arbor MI 48113.2, pp. 161–175, 1994.
  • J. Nerbonne, W. Heeringa, and P. Kleiweg, “Comparison and classification of dialects,” in Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, 1999, pp. 281–282.
  • D. P. Branner, Problems in comparative Chinese dialectology: the classification of Miin and Hakka, vol. 123. Walter de Gruyter, 2000.
  • O. F. Zaidan and C. Callison-Burch, “Arabic dialect identification,” Computational Linguistics., vol. 40, no. 1, pp. 171–202, 2014.
  • E. P. Sanz, J. M. G. Hidalgo, and J. C. C. Pérez, “Email spam filtering,” Advances Computers, vol. 74, pp. 45–114, 2008.
  • A. Bratko, G. V Cormack, B. Filipič, T. R. Lynam, and B. Zupan, “Spam filtering using statistical data compression models,” J. Mach. Learn. Res., vol. 7, no. Dec, pp. 2673–2698, 2006.
  • B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 2002, pp. 79–86.
  • A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau, “Sentiment analysis of twitter data,” in Proceedings of the workshop on languages in social media, 2011, pp. 30–38.
  • A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant supervision,” CS224N Proj. Report, Stanford, vol. 1, p. 12, 2009.
  • X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang, “Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach,” in Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pp. 1031–1040.
  • A. M. Qamar, S. A. Alsuhibany, and S. S. Ahmed, “Sentiment Classification of Twitter Data Belonging to Saudi Arabian Telecommunication Companies,” Int. J. Adv. Comput. Sci. Appl., vol. 1, no. 8, pp. 395–401, 2017.
  • A. Castro and B. Lindauer, “Author Identification on Twitter.” 2012.
  • R. Layton, P. Watters, and R. Dazeley, “Authorship attribution for twitter in 140 characters or less,” in Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second, 2010, pp. 1–8.
  • R. M. Duwairi, “Machine learning for Arabic text categorization,” J. Am. Soc. Inf. Sci. Technol., vol. 57, no. 8, pp. 1005–1010, 2006.
  • S. Alsaleem, “Automated Arabic Text Categorization Using SVM and NB,” vol. 2, no. 2, pp. 124–128, 2011.
  • M. Bekkali and A. Lachkar, “ARABIC TWEETS CATEGORIZATION BASED ON ROUGH SET THEORY,” Int. J. Comput. Sci. Inf. Technol., vol. 6, 2014.
  • W. A. Hussien, Y. M. Tashtoush, M. Al-Ayyoub, and M. N. Al-Kabi, “Are emoticons good enough to train emotion classifiers of arabic tweets?,” in Computer Science and Information Technology (CSIT), 2016 7th International Conference on, 2016, pp. 1–6.
  • A. Alabdullatif, B. Shahzad, and E. Alwagait, “Classification of Arabic Twitter Users: A Study Based on User Behaviour and Interests,” Mob. Inf. Syst., vol. 2016, 2016.
  • A. Alwajeeh, M. Al-Ayyoub, and I. Hmeidi, “On authorship authentication of arabic articles,” in Information and Communication Systems (ICICS), 2014 5th International Conference on, 2014, pp. 1–6.
  • A. S. Altheneyan and M. E. B. Menai, “Naïve Bayes classifiers for authorship attribution of Arabic texts,” J. King Saud Univ. Inf. Sci., vol. 26, no. 4, pp. 473–484, 2014.
  • J. Albadarneh, B. Talafha, M. Al-Ayyoub, B. Zaqaibeh, M. Al-Smadi, Y. Jararweh, and E. Benkhelifa, “Using big data analytics for authorship authentication of arabic tweets,” in Utility and Cloud Computing (UCC), 2015 IEEE/ACM 8th International Conference on, 2015, pp. 448–452.
  • K. Alsmearat, M. Al-Ayyoub, and R. Al-Shalabi, “An extensive study of the bag-of-words approach for gender identification of arabic articles,” in Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on, 2014, pp. 601–608.
  • K. Alsmearat, M. Shehab, M. Al-Ayyoub, R. Al-Shalabi, and G. Kanaan, “Emotion analysis of arabic articles and its impact on identifying the author’s gender,” in Computer Systems and Applications (AICCSA), 2015 IEEE/ACS 12th International Conference of, 2015, pp. 1–6.
  • Tweepy, “Tweepy,” Tweepy.org. [Online]. Available: Tweepy.org. [Accessed: 05-Mar-2016].
  • Majeed Timraz, “kotobji,” Twitter, 2012. [Online]. Available: https://twitter.com/majeedtimraz0. [Accessed: 07-Apr-2017].
  • K. M. Alhwaiti, “Adaptive Models of Arabic Text,” Bangor University, 2014.
  • A. Moffat, “Implementing the PPM data compression scheme,” IEEE Trans. Commun., vol. 38, no. 11, pp. 1917–1921, 1990.
  • P. G. Howard, “The Design and Analysis of E cient Lossless Data Compression Systems.” Diss. PhD thesis, Brown University, 1993.
  • J. G. Cleary and W. J. Teahan, “Unbounded length contexts for PPM,” Comput. J., vol. 40, no. 2 and 3, pp. 67–75, 1997.
  • P. Wu and W. J. Teahan, “A new PPM variant for Chinese text compression,” Nat. Lang. Eng., vol. 14, no. 3, pp. 417–430, 2008.
  • W. J. Teahan, “Adaptive Models of English Text,” Waikato University, 1998.
  • W. J. Teahan and D. J. Harper, “Combining PPM models using a text mining approach,” in Data Compression Conference, 2001. Proceedings. DCC 2001., 2001, pp. 153–162.
  • M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” ACM SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18, 2009.

Abstract Views: 303

PDF Views: 132




  • Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM

Abstract Views: 303  |  PDF Views: 132

Authors

Mohammed Altamimi
Department of Computer Sciences And Engineering, University of Hail, Saudi Arabia
William J. Teahan
Department of Computer Science, University of Bangor, Bangor, United Kingdom

Abstract


In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching (PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and authorship respectively.

Keywords


Arabic Text Categorisation, Data Compression, Machine Learning Algorithms.

References