Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Lossless Text Compression for Unicode Tamil Documents


Affiliations
1 Department of Computer Science, Vidyasagar College of Arts and Science, India
     

   Subscribe/Renew Journal


Data compressions for different world languages including Indian languages are in high need and demand. Tamil language is one of the longest-surviving classical languages in the world. Usage of Tamil language for communication and storage was increased due to the digitization of government documents and orders. Lossless text compression process for Tamil language document involves substituting an ASCII character in place of Unicode Tamil characters, since the size of an ASCII character is one byte where as a Unicode character size range between 1 byte to 4 bytes depends on the encoding file storage type. The decompression process involves the reverse of compression technique (i.e) replacing ASCII characters with Unicode characters. This paper describes about the architecture of compression and decompression process for Tamil text documents.

Keywords

Compression, Decompression, Unicode, ASCII and Substitution.
Subscription Login to verify subscription
User
Notifications
Font Size

  • Ajantha Devi and S.Santhosh Baboo, “Embedded Optical Character Recognition on Tamil Text Image using Raspberry Pi”, International Journal of Computer Science Trends and Technology, Vol. 2, No. 4, pp. 11-15, 2014.
  • https://en.wikipedia.org/wiki/ASP.NET, Accessed on 2017.
  • Arafat Awajan and Enas Abu Jrai, “Hybrid Techniques for Arabic Text Compression”, Global Journal of Computer Science and Technology, Vol. 15, No. 1, pp. 23-27, 2015.
  • Linkon Barua et al., “Bangla Text Compression based on Modified Lempel-Ziv-Welch Algorithm”, Proceedings of IEEE International Conference on Electrical, Computer and Communication Engineering, pp. 113-118, 2017.
  • Guy E. Blelloch, “Introduction to Data Compression”, PhD Dissertation, Computer Science Department, CarNegie Mellon University, 2001.
  • Eibe Frank, Chang Chui and Ian H. Witten, “Text Categorization using Compression Models”, Available at: https://www.cs.waikato.ac.nz/~eibe/pubs/Frank_categorization.full.pdf.
  • S. Hewavitharana and H.C. Fernando, “A Two Stage Classification Approach to Tamil Handwriting Recognition”, Proceedings of Tamil Internet, pp. 118-124, 2002.
  • Adam Gleave and Christian Steinruecken, “Making Compression Algorithms for Unicode Text”, Proceedings of Data Compression Conference, pp. 22-25, 2017.
  • Goetz Graefe and Leonard D. Shapiro, “Data Compression and Database Performance”, Proceedings of Symposium on Applied Computing, pp. 11-15, 1991.
  • Svend Juul and Morten Frydenberg. “UNICODE2ASCII: Stata modules to translate between Unicode and ASCII”, Available at: https://ideas.repec.org/c/boc/bocode/s458080.html, Accessed on 2016.
  • Harsimran Kaur and Balkrishan Jindal, “Lossless Text Data Compression using Modified Huffman Coding-A Review”, Proceedings of International Conference on Technologies for Sustainability-Engineering, Information Technology, Management and the Environment, pp. 1017-1025, 2015.
  • S.R. Kodituwakku and U.S. Amarasinghe, “Comparison of Lossless Data Compression Algorithms for Text Data”, Indian Journal of Computer Science and Engineering, Vol. 1, No. 4, pp. 416-425, 2010.
  • Anish Kumar, Anish, Sk Sakir Ali and Debashis Chakraborty, “Text Database Compression using Replacement and Bit Reduction”, Proceedings of International Conference on Computer Science and Information Technology, pp. 409-416, 2012.
  • Shihjong Kuo, “Processors, Methods, Systems, and Instructions to Transcode Variable Length Code Points of Unicode Characters”, U.S. Patent, 2017.
  • Mahesh Dattatray Kulkarni et al., “System and Method for Compression and Decompression of Text Data”, U.S. Patent, 2017.
  • Lishamol Philip and K.M. Abubeker, “LiBek II: A Novel Compression Architecture using Adaptive Dictionary”, Proceedings of International Conference on IEEE Emerging Technological Trends, pp. 212-218, 2016.
  • Radu Radescu, “Transform Methods used in Lossless Compression of Text Files”, Romanian Journal of Information Science and Technology, Vol. 12, No. 1, pp. 101-115, 2009.
  • J. Nelson Raja, P. Jaganathan and S. Domnic. “A New Variable-Length Integer Code for Integer Representation and its Application to Text Compression”, Indian Journal of Science and Technology, Vol. 8, No. 24, pp. 11-14, 2015.
  • R. Ramachandran and Ashik Ali, “Social Challenges faced by Technology in Developing Countries: Focus on Tamil Nadu State”, Available at: http://shura.shu.ac.uk/15773/1/Ashik%20Ali%20CISDIDC2017.pdf
  • S. Divakaran, C.L. Biji, C.Anjali and Achuth Sankar S. Nair, “Malayalam Text Compression”, International Journal of Information Systems and Engineering, Vol. 1, No. 1, pp. 7-11, 2013.
  • David Salomon, “Data Compression: The Complete Reference”, 4th Edition, Springer, 2007.
  • Sandip V Maniya and M.J. Sheth, “Compression Technique based on Dictionary Approach for Gujarati Text”, International Journal of Engineering Research and Development, Vol. 4, No. 8, pp. 101-108, 2012.
  • R. Seethalakshmi et.al., “Optical Character Recognition for Printed Tamil Text using Unicode”, Journal of Zhejiang University Science, Vol. 6, No. 11, pp. 1297-1305, 2005.
  • J. Sevcik and J. Dvorsky, “Techniques of Czech Language Lossless Text Compression”, Proceedings of International Conference on Computer Information Systems and Industrial Management, pp. 813-816, 2016.
  • Siva Jyothi Chandra, Ashlesha Pandhare and Mamatha Vani, “Multilingual Font Creation by Mapping Unicode to ASCII”, International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 5, No. 9, pp. 12-18, 2015.
  • James A. Storer, “Image and Text Compression”, Springer, 2012.
  • J. Venkatesh and C. Suresh Kumar, “Tamil Handwritten Character Recognition using Kohonon’s Self Organizing Map”, International Journal of Computer Science and Network Security, Vol. 9, No. 12, pp. 156-161, 2009.
  • www.tamilvu.org/doc_file/it_e_5_2013.pdf, Accessed on 2017.
  • www.unicode.org/charts/PDF/U0B80.pdf, Accessed on 2017.

Abstract Views: 273

PDF Views: 3




  • Lossless Text Compression for Unicode Tamil Documents

Abstract Views: 273  |  PDF Views: 3

Authors

B. Vijayalakshmi
Department of Computer Science, Vidyasagar College of Arts and Science, India
N. Sasirekha
Department of Computer Science, Vidyasagar College of Arts and Science, India

Abstract


Data compressions for different world languages including Indian languages are in high need and demand. Tamil language is one of the longest-surviving classical languages in the world. Usage of Tamil language for communication and storage was increased due to the digitization of government documents and orders. Lossless text compression process for Tamil language document involves substituting an ASCII character in place of Unicode Tamil characters, since the size of an ASCII character is one byte where as a Unicode character size range between 1 byte to 4 bytes depends on the encoding file storage type. The decompression process involves the reverse of compression technique (i.e) replacing ASCII characters with Unicode characters. This paper describes about the architecture of compression and decompression process for Tamil text documents.

Keywords


Compression, Decompression, Unicode, ASCII and Substitution.

References