Open Access Open Access  Restricted Access Subscription Access

Word Level Language Identification of English-Punjabi Code-Mixed Social Media Text


Affiliations
1 Department of Computer Science, Punjabi University College of Engineering & Management, Rampura Phul, India
2 Department of Computer Science Punjabi University, Patiala, India
3 Department of Computer Science and Engineering, Yadavindra College of Engineering, Talwandi Sabo, India
 

Code mixing denotes using multiple languages in an utterance. It is clearly seen that code mixing is pervasive while people communicate over social media irrelevant of the mode being used. The fusion of languages makes it more challenging and requires consistent updates according to recent trends. The current paper addresses three approaches namely CRFs (Conditional Random Fields), Bi-LSTM (Long Short-term Memory) and CNNs( Convolutional Neural Networks). Firstly, for word-level language identification of code-mixed English-Punjabi text CRF based system uses lexical, contextual, character ngram, and special character features. Secondly, Recursive Neural Network namely Bi-LSTM with glove embedding is used for language identification and thirdly CNN with glove embedding is used for language identification. It is observed that CRFs is the best performing system with an f1-score of 0.96.

Keywords

Code Mixing, Language Identification, Deep Learning, Glove Embedding, Conditional Random Fields.
User
Notifications
Font Size

  • Neetika, Vishal Goyal, and Simpel Rani. "Automatic Understanding of Code Mixed Social Media Text: A State of the Art." Advances in Information Communication Technology and Computing: 91. https://doi.org/10.1007/978-981-15-5421-6_10
  • Gold, E. Mark. "Language identification in the limit." Information and control 10, no. 5 (1967): 447-474.
  • Gumperz, John J. Discourse strategies. Vol. 1. Cambridge University Press, 1982.
  • Myers-Scotton, Carol. Duelling languages: Grammatical structure in codeswitching. Oxford University Press, 1997.
  • Beesley, Kenneth R. "Language identifier: A computer program for automatic natural-language identification of on-line text." In Proceedings of the 29 th annual conference of the American Translators Association, vol. 47, p. 54. 1988.
  • Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol. 161175.1994.
  • Dunning, Ted. Statistical identification of language. Las Cruces, NM, USA: Computing Research Laboratory, New Mexico State University,1994.
  • Prager, John M. "Linguini: Language identification for multilingual documents." Journal of Management Information Systems 16, no. 3 (1999): 71-101.
  • Lui, Marco, and Timothy Baldwin. "langid. py: An off-the-shelf language identification tool." In Proceedings of the ACL 2012 system demonstrations, pp. 25-30. 2012.
  • King, Ben, and Steven Abney. "Labeling the languages of words in mixed-language documents using weakly supervised methods." In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1110-1119. 2013.
  • Lignos, Constantine, and Mitch Marcus. "Toward web-scale analysis of codeswitching." In 87 th Annual Meeting of the Linguistic Society of America, vol. 90. 2013.
  • Nguyen, Dong, and A. Seza Dogruoz. 2013 Word level language identification in online multilingual communication. In Proceedings of the 2013 conference on empirical methods in natural language processing, 857-862.
  • Nguyen, Dong, and A. Seza Dogruoz. "Word level language identification in online multilingual communication." In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 857-862. 2013.
  • GokulChittaranjan, Yogarshi Vyas, Kalika Bali, and Monojit Choudhury. "A framework to label code-mixed sentences in social media." In Proceedings of the First Workshop on Computational Approaches to Code-Switching, Doha, Qatar, October. ACL. 2014.
  • Chang, Joseph Chee, and Chu-Cheng Lin. "Recurrent-neural-network for language detection on Twitter code-switching corpus." arXiv preprint arXiv: 1412.4314 (2014).
  • Sharma, Arnav, and Raveesh Motlani. "Pos tagging for code-mixed indian social media text: Systems from iiit-h for icon nlp tools contest." In International Conference On Natural Language Processing. 2015.
  • Samih, Younes, Suraj Maharjan, Mohammed Attia, Laura Kallmeyer, and Thamar Solorio. "Multilingual code-switching identification via lstm recurrent neural networks." In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50-59. 2016.
  • Shekhar, Shashi, Dilip Kumar Sharma, and MM Sufyan Beg. "Embedding Framework for Identifying Ambiguous Words in Code-Mixed Social Media Text." In 2019 International Conference on contemporary Computing and Informatics (IC3I), pp. 59-63. IEEE, 2019.
  • Jamatia, Anupam, Amitava Das, and Bjorn Gamback. "Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora." Journal of Intelligent Systems 28, no. 3 (2019): 399-408.
  • Bhaskaran, Sreebha, Geetika Paul, Deepa Gupta, and J. Amudha. "Indian Language Identification for Short Text." In Advances in Computational Intelligence and Communication Technology, pp. 47-58. Springer, Singapore. 2020.
  • Jamatia, Anupam, Steve Durairaj Swamy, Bjorn Gamback, Amitava Das, and Swapan Debbarma. "Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus." International Journal on Artificial Intelligence Tools (2020).
  • Bansal, Neetika, Vishal Goyal, and Simpel Rani. "Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media Text." International Journal of E-Adoption (IJEA) 12, no. 1 (2020): 52-62.
  • Gundapu, Sunil, and Radhika Mamidi. "Word Level Language Identification in English Telugu Code Mixed Data." In PACLIC. 2018.

Abstract Views: 388

PDF Views: 0




  • Word Level Language Identification of English-Punjabi Code-Mixed Social Media Text

Abstract Views: 388  |  PDF Views: 0

Authors

Neetika Bansal
Department of Computer Science, Punjabi University College of Engineering & Management, Rampura Phul, India
Vishal Goyal
Department of Computer Science Punjabi University, Patiala, India
Simpel Rani
Department of Computer Science and Engineering, Yadavindra College of Engineering, Talwandi Sabo, India

Abstract


Code mixing denotes using multiple languages in an utterance. It is clearly seen that code mixing is pervasive while people communicate over social media irrelevant of the mode being used. The fusion of languages makes it more challenging and requires consistent updates according to recent trends. The current paper addresses three approaches namely CRFs (Conditional Random Fields), Bi-LSTM (Long Short-term Memory) and CNNs( Convolutional Neural Networks). Firstly, for word-level language identification of code-mixed English-Punjabi text CRF based system uses lexical, contextual, character ngram, and special character features. Secondly, Recursive Neural Network namely Bi-LSTM with glove embedding is used for language identification and thirdly CNN with glove embedding is used for language identification. It is observed that CRFs is the best performing system with an f1-score of 0.96.

Keywords


Code Mixing, Language Identification, Deep Learning, Glove Embedding, Conditional Random Fields.

References