n-Gram Character Analysis of English Text on Domain Specific Corpus

Lalit Goyal

Vol 9 (2013)
Pages: 44-48
Published: 2013-12-01

n-Gram Character Analysis of English Text on Domain Specific Corpus

Lalit Goyal

Affiliations
1 Department of Computer Science, DAV College, Jalandhar, India

Abstract
References
Article Metrics
Refbacks

Statistical analysis of a language is a vital part of natural language processing. It refers to a collection of methods used to process large amounts of data and report overall trends. In this paper, frequency and word length analysis of individual characters in English text is performed. Unigram, bigram, trigram and positional analysis characters in the domain specific English corpus in health domain has been studied. Miscellaneous analysis like Percentage occurrence of various numbers of distinct words and their coverage in English Corpus is studied.