Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Analysis of Image Preprocessing Techniques to Improve OCR of Garhwali Text Obtained Using the Hindi Tesseract Model


Affiliations
1 Department of Computer Science, Doon University, India
     

   Subscribe/Renew Journal


A huge amount of information exists in the form of textbooks, paper documents, newspapers, and other physical forms, that is required to be digitized for its effective access and long-time availability. Optical Character Recognition (OCR) is an effective way to digitize the text. In this study, we have used Google’s Tesseract as the OCR tool. The focus of our study is to improve Tesseract’s accuracy on machine-printed Garhwali documents by using image pre-processing techniques including Super-Resolution (SR), different binarization methods (Otsu and adaptive thresholding), skew correction, morphological operations, and Image Magick methods. To improve the Tesseract results, we used the three proposed approaches – two approaches differed by the binarization method (Otsu and adaptive thresholding), and the third approach used ImageMagick methods for pre-processing. For evaluation purposes, we created a dataset by capturing images from a sample of five Garhwali textbooks using two mobile cameras with different resolutions; two books were captured by a high resolution camera and the other three were captured through a low resolution camera. Our experiments showed good results in specific cases, for high-resolution images, 88.13% accuracy was achieved for Otsu thresholding without applying the Super-Resolution and for low resolution images, 87.44% accuracy was achieved for Image Magick with Super-Resolution.

Keywords

Optical Character Recognition, Garhwali Language, Devanagari Script, Image Preprocessing, ImageMagick
Subscription Login to verify subscription
User
Notifications
Font Size

  • R. Smith, “An Overview of the Tesseract OCR Engine”, Proceedings of IEEE International Conference on Document Analysis and Recognition, pp. 1-14, 2007.
  • G. A. Grierson, “Linguistic Survey of India”, Superintendent of Government Printing, 1916.
  • A. El Harraj and N. Raissouni, “OCR Accuracy Improvement on Document Images Through a Novel PreProcessing Approach”, Signal and Image Processing: An International Journal, Vol. 6, No. 4, pp. 1–18, 2015.
  • N. Otsu, “A Threshold Selection Method from Gray-Level Histograms”, IEEE Transactions on Systems, Man, and Cybernetics, Vol. 9, No. 1, pp. 62-66, 1979.
  • P.K. Sahoo, S. Soltani and A.K.C. Wong, “A Survey of Thresholding Techniques”, Computer Vision Graphics and Image Processing, Vol. 41, pp. 233-260, 1988.
  • W. Niblack, “An Introduction to Digital Image Processing”, Englewood Cliffs Publisher, 1986.
  • O.D. Trier and A.K. Jain, “Goal-Directed Evaluation of Binarization Methods”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 12, pp. 1191-1201, 1995.
  • D. Phillips, “Image Processing in C”, 2nd Edition, Mc-Graw Hill, 1994.
  • C. Dong, C.C. Loy, K. He and X. Tang, “Learning a Deep Convolutional Network for Image Super-Resolution”, Available at http://mmlab.ie.cuhk.edu.hk/projects/SRCNN.html, Accessed at 2014.
  • C. Dong, C.C. Loy, K. He and X. Tang, “Image SuperResolution using Deep Convolutional Networks”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, No. 2, pp. 295-307, 2016.
  • C. Dong, C.C. Loy and X. Tang, “Accelerating the SuperResolution Convolutional Neural Network”, Proceedings of IEEE International Conference on Computer Vision, pp. 113, 2016.
  • S. Badla, “Improving the Efficiency of Tesseract OCR Engine”, Master Thesis, Department of Computer Science, San Jose State University, pp. 1-154, 2014.
  • B. Sankur and M. Sezgin, “Survey over Image Thresholding Techniques and Quantitative Performance Evaluation”, Journal of Electronic Imaging, Vol. 13, No. 1, pp. 1-2, 2004.
  • K. Jindal, “Optical Character Recognition of Machine Printed Dogri Language Documents”, Ph.D. Dissertation, Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, pp. 1-178, 2018.
  • R.C. Patil and A.S. Bhalchandra, “Brain Tumour Extraction from MRI Images using Matlab”, International Journal of Electronics, Communication and Soft Computing Science and Engineering, Vol. 2, No. 1, pp. 2277-9477, 2012.
  • G. Priya and K. Nawaz, “Effective Morphological Image Processing Techniques and Image Reconstruction,” International Journal of Trend in Research and Development, Vol. 4, No. 17, pp. 18-22, 2017.
  • Github, “GitHub - sbrunner/deskew: Library used to Deskew a Scanned Document”, Available at
  • https://github.com/sbrunner/deskew, Accessed at 2021.
  • Image Magick, “ImageMagick Documents”, Available at https://imagemagick.org/index.php, Accessed at 2021.
  • Wand, “Wand 0.6.6”, Available at https://docs.wandpy.org/en/0.6.6/, Accessed at 2021.
  • Github, “GitHub tesseract-ocr/tessdata: Trained Models with Support for Legacy and LSTM OCR Engine”, Available at https://github.com/tesseract-ocr/tessdata, Accessed at 2021.

Abstract Views: 127

PDF Views: 0




  • Analysis of Image Preprocessing Techniques to Improve OCR of Garhwali Text Obtained Using the Hindi Tesseract Model

Abstract Views: 127  |  PDF Views: 0

Authors

Sukhbindra Singh Rawat
Department of Computer Science, Doon University, India
Ashutosh Sharma
Department of Computer Science, Doon University, India
Rachana Gusain
Department of Computer Science, Doon University, India

Abstract


A huge amount of information exists in the form of textbooks, paper documents, newspapers, and other physical forms, that is required to be digitized for its effective access and long-time availability. Optical Character Recognition (OCR) is an effective way to digitize the text. In this study, we have used Google’s Tesseract as the OCR tool. The focus of our study is to improve Tesseract’s accuracy on machine-printed Garhwali documents by using image pre-processing techniques including Super-Resolution (SR), different binarization methods (Otsu and adaptive thresholding), skew correction, morphological operations, and Image Magick methods. To improve the Tesseract results, we used the three proposed approaches – two approaches differed by the binarization method (Otsu and adaptive thresholding), and the third approach used ImageMagick methods for pre-processing. For evaluation purposes, we created a dataset by capturing images from a sample of five Garhwali textbooks using two mobile cameras with different resolutions; two books were captured by a high resolution camera and the other three were captured through a low resolution camera. Our experiments showed good results in specific cases, for high-resolution images, 88.13% accuracy was achieved for Otsu thresholding without applying the Super-Resolution and for low resolution images, 87.44% accuracy was achieved for Image Magick with Super-Resolution.

Keywords


Optical Character Recognition, Garhwali Language, Devanagari Script, Image Preprocessing, ImageMagick

References