Open Access Open Access  Restricted Access Subscription Access

A New Approach to Parts of Speech Tagging in Malayalam


Affiliations
1 Department of Computer Science, University of Kerala, India
2 Department of Linguistics, University of Kerala, India
 

Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word's usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes include additional information, with case markers (number, gender etc) and tense markers. A large number of current language processing systems use a parts-of-speech tagger for pre-processing.

There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and statistical information to assign tag to words. It use large corpus, so that Time complexity and Space complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic Approach is the widely used one nowadays because of its accuracy.

Malayalam is a Dravidian family of languages, inflectional with suffixes with the ischolar_main word forms. The currently used Algorithms are efficient Machine Learning Algorithms but these are not built for Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.

My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence structure along with the dictionary entry.


Keywords

NLP, POS Tagger, Rule Based Approach, Stochastic Approach, Multithreading, Dictionary Entry, Malayalam.
User
Notifications
Font Size

Abstract Views: 178

PDF Views: 125




  • A New Approach to Parts of Speech Tagging in Malayalam

Abstract Views: 178  |  PDF Views: 125

Authors

D. Muhammad Noorul Mubarak
Department of Computer Science, University of Kerala, India
Sareesh Madhu
Department of Linguistics, University of Kerala, India
S. A. Shanavas
Department of Linguistics, University of Kerala, India

Abstract


Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word's usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes include additional information, with case markers (number, gender etc) and tense markers. A large number of current language processing systems use a parts-of-speech tagger for pre-processing.

There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and statistical information to assign tag to words. It use large corpus, so that Time complexity and Space complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic Approach is the widely used one nowadays because of its accuracy.

Malayalam is a Dravidian family of languages, inflectional with suffixes with the ischolar_main word forms. The currently used Algorithms are efficient Machine Learning Algorithms but these are not built for Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.

My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence structure along with the dictionary entry.


Keywords


NLP, POS Tagger, Rule Based Approach, Stochastic Approach, Multithreading, Dictionary Entry, Malayalam.