Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Automatic Tamil Document Categorization Based on the Naive Bayes Algorithm


Affiliations
1 Anna University, Chennai, India
     

   Subscribe/Renew Journal


This paper deals with automatic classification of tamil documents. Documents are repositories of knowledge. There are numerous documents available and effective search in documents is time consuming. To make document search a simpler task and for various other applications like event detection and tracking, document clustering and grouping we need to perform document categorization. Document categorization is a challenging task. Document categorization has recently become an active research topic in the area of information retrieval. The objective of document categorization is to assign entries from a set of prespecified categories to a document. Traditionally this categorization task is performed manually by domain experts. Each incoming document is read and comprehended by the expert and then it is assigned to a number of categories chosen from the set of prespecified categories. It is inevitable that a large amount of manual effort is required. A promising way to deal with this problem is to learn a categorization scheme automatically from training examples. In the training phase we are given a set of documents with class labels attached, and a classification system is built using a learning method. Once the categorization scheme is learned, it can be used for classifying future documents. Document category can be found out using various techniques. In this paper, Naive Bayes (NB) which is a statistical machine learning algorithm, is used to classify tamil documents to one of pre-defined categories. Experiments are used to evaluate the Naive Bayes categorizer. The data set used during these experiments consists of 50 documents per category. The experimental results shows that the Naive Bayes classifier performs well and its effectiveness is achieved with 89.8% accuracy.

Keywords

Document Categorization, Naive Bayes, Stopwords, Preprocessing, Classifier.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 233

PDF Views: 7




  • Automatic Tamil Document Categorization Based on the Naive Bayes Algorithm

Abstract Views: 233  |  PDF Views: 7

Authors

S. Kohilavani
Anna University, Chennai, India
T. Mala
Anna University, Chennai, India
T. V. Geetha
Anna University, Chennai, India

Abstract


This paper deals with automatic classification of tamil documents. Documents are repositories of knowledge. There are numerous documents available and effective search in documents is time consuming. To make document search a simpler task and for various other applications like event detection and tracking, document clustering and grouping we need to perform document categorization. Document categorization is a challenging task. Document categorization has recently become an active research topic in the area of information retrieval. The objective of document categorization is to assign entries from a set of prespecified categories to a document. Traditionally this categorization task is performed manually by domain experts. Each incoming document is read and comprehended by the expert and then it is assigned to a number of categories chosen from the set of prespecified categories. It is inevitable that a large amount of manual effort is required. A promising way to deal with this problem is to learn a categorization scheme automatically from training examples. In the training phase we are given a set of documents with class labels attached, and a classification system is built using a learning method. Once the categorization scheme is learned, it can be used for classifying future documents. Document category can be found out using various techniques. In this paper, Naive Bayes (NB) which is a statistical machine learning algorithm, is used to classify tamil documents to one of pre-defined categories. Experiments are used to evaluate the Naive Bayes categorizer. The data set used during these experiments consists of 50 documents per category. The experimental results shows that the Naive Bayes classifier performs well and its effectiveness is achieved with 89.8% accuracy.

Keywords


Document Categorization, Naive Bayes, Stopwords, Preprocessing, Classifier.