In a multi-script multi-lingual environment, a document may contain text lines in more than one script/language forms. It is necessary to identify different script regions of the document in order to feed the document to the OCRs of individual language. With this context, this paper proposes to develop a model to identify and separate text lines of Telugu, Devanagari and English scripts from a printed trilingual document. The proposed method uses the distinct features extracted from the top and bottom profiles of the printed text lines. Experimentation conducted involved 1500 text lines for learning and 900 text lines for testing. The performance has turned out to be 99.67%.
Keywords
Multi-Script Multi-Lingual Document, Script Identification, Feature Extraction.
User
Font Size
Information