Open Access Open Access  Restricted Access Subscription Access

Effect of Stop Word Removal on Document Similarity for Hindi Text


Affiliations
1 Haryana College of Technology and Management, Kaithal, India
2 Punjabi University, Patiala, India
 

Stop word removal is one of the important NLP techniques. Stop words are very common in any document. In this paper, we have created a list of stop words for Hindi text on the basis of frequency of words in documents. Hindi documents from EMILLE corpus have been used for finding out the stop words. UTF-8 encoding is used. The percentage of stop words in any document has been find out and experimentally analyzed. The paper discusses the effect of stop word removal on the similarity of two documents containing Hindi text. Hoad&Zobel approach is used for finding the similarity of documents containing Hindi text.

Keywords

Stop Words, Removal, Text, Hindi, List, Frequency.
User
Notifications
Font Size

Abstract Views: 187

PDF Views: 0




  • Effect of Stop Word Removal on Document Similarity for Hindi Text

Abstract Views: 187  |  PDF Views: 0

Authors

Urvashi Garg
Haryana College of Technology and Management, Kaithal, India
Vishal Goyal
Punjabi University, Patiala, India

Abstract


Stop word removal is one of the important NLP techniques. Stop words are very common in any document. In this paper, we have created a list of stop words for Hindi text on the basis of frequency of words in documents. Hindi documents from EMILLE corpus have been used for finding out the stop words. UTF-8 encoding is used. The percentage of stop words in any document has been find out and experimentally analyzed. The paper discusses the effect of stop word removal on the similarity of two documents containing Hindi text. Hoad&Zobel approach is used for finding the similarity of documents containing Hindi text.

Keywords


Stop Words, Removal, Text, Hindi, List, Frequency.