Open Access
Subscription Access
Effect of Stop Word Removal on Document Similarity for Hindi Text
Stop word removal is one of the important NLP techniques. Stop words are very common in any document. In this paper, we have created a list of stop words for Hindi text on the basis of frequency of words in documents. Hindi documents from EMILLE corpus have been used for finding out the stop words. UTF-8 encoding is used. The percentage of stop words in any document has been find out and experimentally analyzed. The paper discusses the effect of stop word removal on the similarity of two documents containing Hindi text. Hoad&Zobel approach is used for finding the similarity of documents containing Hindi text.
Keywords
Stop Words, Removal, Text, Hindi, List, Frequency.
User
Font Size
Information
Abstract Views: 187
PDF Views: 0