Open Access Open Access  Restricted Access Subscription Access

Comparing State-of-the-art Models for Language Detection Methods on Short Texts from Twitter


Affiliations
1 Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India
 

Short text communication via microblogging platforms like Twitter has become the norm in today’s fast-paced world. These platforms have a global reach; thus, usage of multiple languages (including region-specific languages) is common. Language detection is an important task that finds its application in several NLP tasks as data is available for further analysis, only once its natural language has been detected. In our work, we have analysed and compared the performances of two major state-of-the-art models, which are Naive-Bayes and Logistic Regression for identification of the natural languages, on short-text data. Both the models were trained on a dataset from Kaggle, which made them capable of detecting 22 languages. They were compared on different parameters like accuracy, precision, recall, and f1 score, and it was learnt that Logistic Regression works better on relatively small datasets like ours.

Keywords

Natural language Processing, Language Detection, Logistic Regression, Naive Bayes.
User
Notifications
Font Size

  • Timothy Baldwin and Marco Lui. 2010. Language Identification: The Long and the Short of the Matter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 229– 237, Los Angeles, California. Association for Computational Linguistics
  • Gamallo, Pablo & Garcia, Marcos & Sotelo, Susana & Campos, José. (2014). Comparing ranking-based and Naive Bayes approaches to language detection on tweets. CEUR Workshop Proceedings. 1228. 12-16
  • Sarthak, Shukla, S., Mittal, G. (2019). Spoken Language Identification Using ConvNets. In: Chatzigiannakis, I., De Ruyter, B., Mavrommati, I. (eds) Ambient Intelligence. AmI 2019. Lecture Notes in Computer Science(), vol 11912. Springer, Cham.
  • Konthala Yasaswini, Karthik Puranik, Adeep Hande, Ruba Priyadharshini, Sajeetha Thavareesan, and Bharathi Raja Chakravarthi. 2021. IIITT@DravidianLangTechEACL2021: Transfer Learning for Offensive Language Detection in Dravidian Languages. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 187–194, Kyiv. Association for Computational Linguistics.
  • Pidong Wang, Nikhil Bojja, and Shivasankari Kannan. 2015. A Language Detection System for Short Chats in Mobile Games. In Proceedings of the third International Workshop on Natural Language Processing for Social Media, pages 20–28, Denver, Colorado. Association for Computational Linguistics.
  • Swanson, Ben & Charniak, Eugene. (2012). Native language detection with tree substitution grammars. Vol. 2. 193-197.
  • Thoma, Martin. “The WiLI benchmark dataset for written language identification.” ArXiv abs/1801.07779 (2018): n. pag.
  • Marco Lui, Jey Han Lau, and Timothy Baldwin. 2014. Automatic Detection and Language Identification of Multilingual Documents. Transactions of the Association for Computational Linguistics, 2:27–40.
  • Jônatas Wehrmann, Willian E. Becker, and Rodrigo C. Barros. 2018. A multi-task neural network for multilingual sentiment classification and language detection on Twitter. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (SAC '18). Association for Computing Machinery, New York, NY, USA, 1805–1812. https://doi.org/10.1145/3167132.3167325
  • Wiegand, Michael & Ruppenhofer, Josef. (2021). Exploiting Emojis for Abusive Language Detection. 369-380. 10.18653/v1/2021.eacl-main.28.
  • P. Chakraborty and M. H. Seddiqui, "Threat and Abusive Language Detection on Social Media in Bengali Language," 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), Dhaka, Bangladesh, 2019, pp. 1-6, doi: 10.1109/ICASERT.2019.8934609.
  • Sai Muralidhar Jayanthi, Kavya Nerella, Khyathi Raghavi Chandu, and Alan W Black. 2021. CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic CodeSwitching, pages 113–118, Online. Association for Computational Linguistics.
  • Bharathi B and Agnusimmaculate Silvia A. 2021. SSNCSE_NLP@DravidianLangTechEACL2021: Offensive Language Identification on Multilingual Code Mixing Text. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 313–318, Kyiv. Association for Computational Linguistics.
  • Yash Kumar Lal, Vaibhav Kumar, Mrinal Dhar, Manish Shrivastava, and Philipp Koehn. 2019. De-Mixing Sentiment from Code-Mixed Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 371–377, Florence, Italy. Association for Computational Linguistics.
  • Yili Ma, Liang Zhao, and Jie Hao. 2020. XLP at SemEval-2020 Task 9: Cross-lingual Models with Focal Loss for Sentiment Analysis of Code-Mixing Language. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 975–980, Barcelona (online). International Committee for Computational Linguistics.
  • Li, Shuyue Stella and Kenton Murray. “Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic Patterns.” ArXiv abs/2211.07628 (2022): n. Pag.
  • Kodali, Prashant & Sachan, Tanmay & Goindani, Akshay & Goel, Anmol & Ahuja, Naman & Shrivastava, Manish & Kumaraguru, Ponnurangam. (2022). PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language Model Embeddings To Estimate Code-Mix Quality. 10.48550/arXiv.2206.07988.
  • Prashant Kodali, Anmol Goel, Monojit Choudhury, Manish Shrivastava, and Ponnurangam Kumaraguru. 2022. SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing. In Findings of the Association for Computational Linguistics: ACL 2022, pages 472–480, Dublin, Ireland. Association for Computational Linguistics.
  • SaiKrishna Rallabandi, Sunayana Sitaram, and Alan W Black. 2018. Automatic Detection of Code-switching Style from Acoustics. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 76–81, Melbourne, Australia. Association for Computational Linguistics.
  • Aditya Shah and Chandresh Maurya. 2021. How effective is incongruity? Implications for code-mixed sarcasm detection. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), pages 271–276, National Institute of Technology Silchar, Silchar, India. NLP Association of India (NLPAI).
  • Barman, Utsab & Das, Amitava & Wagner, Joachim & Foster, Jennifer. (2014). Code Mixing: A Challenge for Language Identification in the Language of Social Media. 0.13140/2.1.3385.6967.
  • Winata, Genta & Cahyawijaya, Samuel & Liu, Zihan & Lin, Zhaojiang & Madotto, Andrea & Fung, Pascale. (2021). Are Multilingual Models Effective in Code-Switching?. 142-153. 10.18653/v1/2021.calcs-1.20.
  • Chan, Joyce. (2006). Automatic speech recognition of Cantonese-English code-mixing utterances /.
  • Jinhua Du, Yan Huang, and Karo Moilanen. 2019. AIG Investments.AI at the FinSBD Task: Sentence Boundary Detection through Sequence Labelling and BERT Fine-tuning. In Proceedings of the First
  • J. Y. C. Chan, P. C. Ching, Tan Lee and H. M. Meng, "Detection of language boundary in code-switching utterances by bi-phone probabilities," 2004 International Symposium on Chinese Spoken Language Processing, Hong Kong, China, 2004, pp. 293-296, doi: 10.1109/CHINSL.2004.1409644.
  • Balazevic, Ivana et al. “Language Detection For Short Text Messages In Social Media.” ArXiv abs/1608.08515 (2016): n. Pag.
  • Shruti Rijhwani, Royal Sequiera, Monojit Choudhury, Kalika Bali, and Chandra Shekhar Maddila. 2017. Estimating Code-Switching on Twitter with a Novel Generalized WordLevel Language Detection Technique. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1971–1982, Vancouver, Canada. Association for Computational Linguistics.
  • Suman Dowlagar and Radhika Mamidi. 2021. A Pre-trained Transformer and CNN Model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed SocialMedia Text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 367–374, Held Online. INCOMA Ltd.
  • Santy, Sebastin, Anirudh, Srinivasan, and Monojit, Choudhury. "BERTologiCoMix: How does Code-Mixing interact with Multilingual BERT?."In AdaptNLP EACL 2021.
  • https://www.kaggle.com/code/martinkk5575/language-detection/data (Kaggle dataset)
  • https://www.kaggle.com/code/rtatman/analyzing-multilingual-data/data (Tweet data)
  • https://www.kaggle.com/code/ayushmi77al/language-detection-nlp/notebook
  • https://thecleverprogrammer.com/2021/10/30/language-detection-with-machine-learning/
  • https://www.sciencedirect.com/

Abstract Views: 239

PDF Views: 0




  • Comparing State-of-the-art Models for Language Detection Methods on Short Texts from Twitter

Abstract Views: 239  |  PDF Views: 0

Authors

Devendra Kumar Tayal
Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India
Yashima Hooda
Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India
Diksha
Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India
Aananya Nagpal
Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India

Abstract


Short text communication via microblogging platforms like Twitter has become the norm in today’s fast-paced world. These platforms have a global reach; thus, usage of multiple languages (including region-specific languages) is common. Language detection is an important task that finds its application in several NLP tasks as data is available for further analysis, only once its natural language has been detected. In our work, we have analysed and compared the performances of two major state-of-the-art models, which are Naive-Bayes and Logistic Regression for identification of the natural languages, on short-text data. Both the models were trained on a dataset from Kaggle, which made them capable of detecting 22 languages. They were compared on different parameters like accuracy, precision, recall, and f1 score, and it was learnt that Logistic Regression works better on relatively small datasets like ours.

Keywords


Natural language Processing, Language Detection, Logistic Regression, Naive Bayes.

References