The automated categorization of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. This paper describes, a novel method for the automatic induction of rule-based text classifiers. This method supports a hypothesis language of the form "if T1, … or Tn occurs in document d, and none of T1+n,... Tn+m occurs in d, then classify d under category c," where each Ti is a conjunction of terms. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. Issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation were discussed in detail.
Keywords
Data Mining, Text Mining, Clustering, Classification, and Association Rules, Mining Methods and Algorithms.
User
Font Size
Information