Open Access Open Access  Restricted Access Subscription Access

Comparing State-of-the-art Models for Language Detection Methods on Short Texts from Twitter


Affiliations
1 Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India
 

Short text communication via microblogging platforms like Twitter has become the norm in today’s fast-paced world. These platforms have a global reach; thus, usage of multiple languages (including region-specific languages) is common. Language detection is an important task that finds its application in several NLP tasks as data is available for further analysis, only once its natural language has been detected. In our work, we have analysed and compared the performances of two major state-of-the-art models, which are Naive-Bayes and Logistic Regression for identification of the natural languages, on short-text data. Both the models were trained on a dataset from Kaggle, which made them capable of detecting 22 languages. They were compared on different parameters like accuracy, precision, recall, and f1 score, and it was learnt that Logistic Regression works better on relatively small datasets like ours.

Keywords

Natural language Processing, Language Detection, Logistic Regression, Naive Bayes.
User
Notifications
Font Size


  • Comparing State-of-the-art Models for Language Detection Methods on Short Texts from Twitter

Abstract Views: 341  |  PDF Views: 0

Authors

Devendra Kumar Tayal
Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India
Yashima Hooda
Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India
Diksha
Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India
Aananya Nagpal
Department of Computer Science & Engineering Indira Gandhi Delhi Technical University for Women Delhi, India., India

Abstract


Short text communication via microblogging platforms like Twitter has become the norm in today’s fast-paced world. These platforms have a global reach; thus, usage of multiple languages (including region-specific languages) is common. Language detection is an important task that finds its application in several NLP tasks as data is available for further analysis, only once its natural language has been detected. In our work, we have analysed and compared the performances of two major state-of-the-art models, which are Naive-Bayes and Logistic Regression for identification of the natural languages, on short-text data. Both the models were trained on a dataset from Kaggle, which made them capable of detecting 22 languages. They were compared on different parameters like accuracy, precision, recall, and f1 score, and it was learnt that Logistic Regression works better on relatively small datasets like ours.

Keywords


Natural language Processing, Language Detection, Logistic Regression, Naive Bayes.

References