Open Access Open Access  Restricted Access Subscription Access

Enhanced Semantic Similarity Detection of Program Code Using Siamese Neural Network


Affiliations
1 Kofar kaura layout kastina, State, Nigeria
2 Department of Computer Science, Usmanu Danfodiyo University, Sokoto, Nigeria
3 Ministry of Finance Economic and Development, Damaturu, Yobe State, Nigeria
 

Even though there are various source code plagiarism detection approaches, most of them are only concerned with lexical similarities attack with an assumption that plagiarism is only conducted by students who are not proficient in programming. However, plagiarism is often conducted not only due to student incapability but also because of bad time management. Thus, semantic similarity attacks should be detected and evaluated. This research proposes a source code semantic similarity detection approach that can detect most source code similarities by representing the source code into an Abstract Syntax Tree (AST) and evaluating similarity using a Siamese neural network. Since AST is a language-dependent feature, the SOCO dataset is selected which consists of C++ program codes. Based on the evaluation, it can be concluded that our approach is more effective than most of the existing systems for detecting source code plagiarism. The proposed strategy was implemented and an experimental study based on the AI-SOCO dataset revealed that the proposed similarity measure achieved better performance for the recommendation system in terms of precision, recall, and f1 score by 15%, 10%, and 22% respectively in the 100,000 datasets. In the future, it is suggested that the system can be improved by detecting inter-language source code similarity.

Keywords

Source Code, Lexical plagiarism, Semantic neural network.
User
Notifications
Font Size

  • A. Alex. MOSS (Measure of software similarity) plagiarism detection system 1994. Retrieved from http:/www.cs.berkely.edu/~moss/. University of Berkely, CA.
  • I. Baxter, A. Yahin, L. Moura, M. Anna, and L. Bier. Clone detection using abstract syntax trees. IEEE. Published in the Proceedings of International conference on software maintenance (ICSM’98): 1998.pp 368-377.
  • B. N. Pellin. Using classification techniques to determine source code authorship. White paper: department of computer science, university of Wisconsin. 2000.
  • D. Zou, W. Long, and Z. Ling. A cluster-based plagiarism detection method - Lab report for PAN at CLEF In Proceedings of the 4th Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse 2010.
  • J. Son, S. Park, and S. Park. Program Plagiarism Detection Using Parse Tree Kernels. PRICAI’06 Proceedings of the 9th Pacific Rim international conference on artificial intelligence: 2006. pp 1000– 1004.
  • C. Liu, C. Chen, J. Han and P. S. Yu. GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 2006. pp 872- 881.
  • S. Harihan. Automatic Plagiarism Detection Using Similarity Analysis. The International Arab Journal of Information Technology: Vol. 9, issue 4. 2012.
  • Z. Duric, and D. Gasevic. A Source Code Similarity System for Plagiarism Detection. The Computer Journal: Vol. 56, issue 1, 2012. pp 70-86.
  • L. Zhang, D. Liu, Y. Li, and M. Zhong. AST-Based Plagiarism Detection Method. In: Wang Y., Zhang X. (eds) Internet of Things. Communications in Computer and Information Science. Springer, Berlin, Heidelberg, Vol. 312. 2012.
  • D. Ganguly, G. J. F. Jones, A. Ramı´rez-de-la-Cruz, G. Ramı ´rez-de-la-Rosa, and E. Villatoro-Tello. Retrieving and classifying instances of source code plagiarism: Information Retrieval journal: 2017. pp 1- 23.
  • E. Flores, A. Barro ´n-Ceden ˜o, P. Rosso, and L. Moreno. Towards the detection of cross-language source code reuse. In Proceedings of the 16th international conference on applications of natural language to information systems: 2011. pp. 250–253.
  • S. Narayanan and S. Simi. Source Code Plagiarism Detection and Performance Analysis Using Fingerprint Based Distance Measure Method. In Proceedings of 7th International Conference on Computer Science Education ICCSE ’12. IEEE: pp 1065-1068.
  • R. Marinescu. Accessing Technical Debt by Identifying Design Flaws in Software Systems. IBM Journal of Research and Development: Vol. 56(5), 2012. pp 1-9.
  • S. Ion and I. Bogdan. Source Code Plagiarism Detection Method Using Protégé Built Ontologies: Informatics EconomicsJournal: Vol. 17, 2013. pp 75- 86.
  • T. Ohmann and I. Rahal. Efficient clustering-based source code plagiarism detection using PIY. Journal of Knowledge and Information Systems: vol. 43, 2014. pp 445-447.
  • J. Zhao, K. Xia, Y. Fu, and B. Cui. An AST-Based Code Plagiarism Detection Algorithm: 10th International Conference on Broadband and Wireless Computing, Communication and Application. 2015.
  • N. More, A. A. Bhootra and C. A. Patel. Plagiarism Detection in Source Code. IJIRST –International Journal for Innovative Research in Science & Technology: Volume 1, Issue 10 | March 2015 ISSN (online): 2349-6010, 2015. pp 109-112.
  • N. Shah, S. Modha, and d. Dave. Differential Weight Based Hybrid Approach to Detect Software Plagiarism. In Proceedings of International Conference on ICT for Sustainable Development: Vol. 409, 2016. pp 645-653.
  • O. Karnalim. Detecting Source Code Plagiarism on Introductory Programming Course Assignments Using a Bytecode Approach. The 10th International Conference on Information, Communication Technology and System (ICTS), Surabaya, Indonesia: IEEE, 2016. pp 63-68.
  • O. Karnalim. A Low-Level Structure-based Approach for Detecting Source Code Plagiarism. IAENG International Journal of Computer Science: volume 44, 2017. pp 4.
  • M. Duracik, E. Kirsak, and P. Hrkut. Source Code Representations for Plagiarism Detection. Springer International Publishing AG, part of Springer Nature: CCIS 870, 2018. pp 61–69.
  • M. Duracik, E. Kirsak, and P. Hrkut. Scalable Source Code Plagiarism Detection Using Source Code Vectors Clustering. IEEE Journal: 2018. pp 7-18
  • O. Karnalim. Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation. Jornal of Informatics in Education: Vol 18(2), 2019. pp 321-344.
  • Dr. R, Kulkarni1 and K., Apana. A Novel Approach to Restructure the Input Java Program. Journal of advanced networking and applications: Volume: 12 Issue: 04 Pages: 4621-4626(2021).
  • R. Kulkarni and P., Pani. Abstraction of UML Class Diagram from the Input Java Program. Journal of advanced networking and applications: Volume: 12 Issue: 04 Pages: 4644-4649(2021).

Abstract Views: 144

PDF Views: 0




  • Enhanced Semantic Similarity Detection of Program Code Using Siamese Neural Network

Abstract Views: 144  |  PDF Views: 0

Authors

Hadiza Lawal Abba
Kofar kaura layout kastina, State, Nigeria
Abubakar Roko
Department of Computer Science, Usmanu Danfodiyo University, Sokoto, Nigeria
Aminu B. Muhammad
Department of Computer Science, Usmanu Danfodiyo University, Sokoto, Nigeria
Abdulgafar Usman
Ministry of Finance Economic and Development, Damaturu, Yobe State, Nigeria
Abba Almu
Department of Computer Science, Usmanu Danfodiyo University, Sokoto, Nigeria

Abstract


Even though there are various source code plagiarism detection approaches, most of them are only concerned with lexical similarities attack with an assumption that plagiarism is only conducted by students who are not proficient in programming. However, plagiarism is often conducted not only due to student incapability but also because of bad time management. Thus, semantic similarity attacks should be detected and evaluated. This research proposes a source code semantic similarity detection approach that can detect most source code similarities by representing the source code into an Abstract Syntax Tree (AST) and evaluating similarity using a Siamese neural network. Since AST is a language-dependent feature, the SOCO dataset is selected which consists of C++ program codes. Based on the evaluation, it can be concluded that our approach is more effective than most of the existing systems for detecting source code plagiarism. The proposed strategy was implemented and an experimental study based on the AI-SOCO dataset revealed that the proposed similarity measure achieved better performance for the recommendation system in terms of precision, recall, and f1 score by 15%, 10%, and 22% respectively in the 100,000 datasets. In the future, it is suggested that the system can be improved by detecting inter-language source code similarity.

Keywords


Source Code, Lexical plagiarism, Semantic neural network.

References