Open Access Open Access  Restricted Access Subscription Access

Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script


Affiliations
1 Department of Computer Engineering, Punjabi University, Patiala - 147002, Punjab, India
2 Department of Computer Science, Punjabi University, Patiala - 147002, Punjab, India
 

There are two most popular writing styles of Urdu i.e. Naskh and Nastaliq. Considering Arabic OCR research, ample amount of work has been done on Naskh writing style; focusing on Urdu, which uses Arabic character set commonly used Nastaliq writing style. Due to Nastaliq writing style, Urdu OCR poses many distinct challenges like compactness, diagonal orientation and context character shape sensitivity etc., for OCR system to correctly recognize the Urdu text image. Due to compactness and slanting nature of Nastaliq writing style, existing methods for Naskh style would not give desirable results. Therefore, in this paper, we are presenting ligature based segmentation OCR system for Urdu Nastaliq script. We have discussed in detail various unique challenges for the Urdu OCR and different feature extraction techniques for Ligature recognition using SVM and kNN classifier. The system is trained to recognize 11,000 Urdu ligatures. We have achieved overall 90.29% accuracy tested on Urdu text images.

Keywords

Feature Extraction (DCT, Directional, Gabor and Gradient), K-Nearest Neighbor, SVM, Urdu OCR
User

Abstract Views: 196

PDF Views: 0




  • Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script

Abstract Views: 196  |  PDF Views: 0

Authors

Ankur Rana
Department of Computer Engineering, Punjabi University, Patiala - 147002, Punjab, India
Gurpreet Singh Lehal
Department of Computer Science, Punjabi University, Patiala - 147002, Punjab, India

Abstract


There are two most popular writing styles of Urdu i.e. Naskh and Nastaliq. Considering Arabic OCR research, ample amount of work has been done on Naskh writing style; focusing on Urdu, which uses Arabic character set commonly used Nastaliq writing style. Due to Nastaliq writing style, Urdu OCR poses many distinct challenges like compactness, diagonal orientation and context character shape sensitivity etc., for OCR system to correctly recognize the Urdu text image. Due to compactness and slanting nature of Nastaliq writing style, existing methods for Naskh style would not give desirable results. Therefore, in this paper, we are presenting ligature based segmentation OCR system for Urdu Nastaliq script. We have discussed in detail various unique challenges for the Urdu OCR and different feature extraction techniques for Ligature recognition using SVM and kNN classifier. The system is trained to recognize 11,000 Urdu ligatures. We have achieved overall 90.29% accuracy tested on Urdu text images.

Keywords


Feature Extraction (DCT, Directional, Gabor and Gradient), K-Nearest Neighbor, SVM, Urdu OCR



DOI: https://doi.org/10.17485/ijst%2F2015%2Fv8i35%2F124870