A Survey on Pre-Processing Techniques for Text Mining

Manthan J. Vyas; Sanjay D. Bhanderi

A Survey on Pre-Processing Techniques for Text Mining

Affiliations
1 Marwadi Education Foundation Group of Institutions, Gujarat Technological University, Ahmedabad, Gujarat, India
2 Department of Computer Engineering, Marwadi Education Foundation Group of Institutions, Gujarat Technological University, Rajkot, India

Text mining is the process of obtaining interesting patterns or knowledge from text documents. The most often used type of data in the WWW is text. Text mining is used to extract interesting knowledge from unstructured text data. Pre-processing is a very important phase in the text mining process. Text mining framework includes two components, text refining and knowledge distillation. This paper is about pre-processing for text mining in English and Gujarati language. There is very less work done for text mining in Gujarati language. It is very challenging task as Gujarati is very rich in morphology, it gives rise to a very large number of word forms and feature spaces. Some pre-processing techniques in Gujarati are introduced in this paper.