Open Access Open Access  Restricted Access Subscription Access

Comprehensive Evaluation of Machine Learning Techniques and Novel Features for Web Link Spamdexing Detection


Affiliations
1 Department of Computer Science, Vellalar College for Women, Erode, India
2 Department of Computer Science, KSR College of Arts and Science, Tiruchengode, India
3 Kumaraguru College of Technology, Coimbatore, India
 

World Wide Web (WWW) is a huge, dynamic, self-organized, and strongly interlinked source of information. Search engine became a vital IR (Information Retrieval) system to retrieve the required information. Results appearing in the first few pages gain more attraction and importance. Since users believe that they were more relevant because of its top positions. Spamdexing plays a key role in making high rank and top visibility for an undeserved page. This paper focus on two aspects: new features and new classifiers. First, 27 new features which are used to commercially boost the ranking and reputation are considered for classification. Along with them 17 new features were proposed and computed. Totally 44 features were combined with the existing WEBSPAM-UK 2007 dataset which is the baseline. With all these features, feature inclusion study is carried out to elevate the performance. Second aspect considered in this paper is exploring new suite of five different machine learners for the web spam classification problem. Results are discussed. New feature inclusion improves the classification accuracy of the publicly available WEBSPAM-UK 2007 features by 22%. SVM outperforms well than the other methods in terms of accuracy.

Keywords

Decision Table, HMM, Search Engine, SVM, Web Spam.
User
Notifications
Font Size


  • Egele M., Kolbitsch C., and Platzer C., “Removing Web Spam Links from Search Engine Results”, Journal of Computational Virology, Springer-Verlag, France, 2009.
  • Delany S.J., Cunningham P., and Coyle L., “An Assessment of Case-Based Reasoning for Spam Filtering”, Springer Artificial Intelligence Review, p. 359–378, 2005.
  • Chung Y., Toyoda M., and Kitsuregawa M., “Identifying Spam Link Generators for Monitoring Emerging Web Spam”, WICOW’10, North Carolina, USA, 2010. p. 51–58.
  • Erdelyi M., Garzo A., and Benczur A., “Web spam classification: a few features worth more”, WICOW/AIRWeb Workshop on Web Quality, India, 2011. p. 27–34.
  • Karimpour J., Noroozi A., and Abadi A., “The Impact of Feature Selection on Web Spam Detection”, I.J. Intelligent Systems and Applications, p. 61–67, 2012.
  • Geng G., Wang C.H., and Dan Li Q., “Improving Web Spam Detection with Re-Extracted Features”, WWW 2008, Beijing, China. 2008. ACM, p. 1119–1120.
  • Benczur A., Bıro I., Csalogany K., and Sarlos T., “Web spam detection via commercial intent analysis”, 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb’07. 2007.
  • Gan Q., and Suel T., “Improving Web Spam Classifiers Using Link Structure”, AIRWeb ’07, Canada. 2007.
  • Jayanthi S.K., Sasikala S., “WESPACT: Detection of Web Spamdexing with Decision Trees in GA Perspective”, International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012), Periyar University, Salem, IEEE Xplore, Listed in SCOPUS, 2012 Mar 21–23. p. 381–386.
  • Jayanthi S.K., and Sasikala S., “REPTree Classifier for Identifying Link Spam in Web Search Engines”, Ictact Journal On Soft Computing, vol. 3(2), p. 498–505, 2007
  • Jayanthi S.K., Sasikala S., “Web Link Spam Identification Inspired By Artificial Immune System and the Impact of TPP-FCA Feature Selection on Spam Classification”, Ictact Journal On Soft Computing, vol. 4(1), p. 633–644, 2013 Oct.
  • Jayanthi S.K., Sasikala S., “Naïve Bayesian Classifier and PCA for Web Link Spam Classification”, Georgian Electronic and Scientific Journal, GESJ: Computer Science and Telecommunications, vol. 1(41), 2014 Mar.
  • Tian Y., Weiss G.M., and Ma Q., “A Semi-Supervised Approach for Web Spam Detection using Combinatorial FeatureFusion”, Graph labeling workshop and web spam challenge in European Conference on Machine Learning and Principles and Practice of Knowledge Discovery, 2010. p. 16–23.
  • Radicati01. Available: www.radicati.com, Accessed on Nov 2010.
  • Radicati02. Available: http://www.radicati.com/wp/wpcontent/ uploads/2013/05/Corporate-Web-Security-Market2013-2017-Executive-Summary.pdf, Accessed on Oct 2013.
  • Symantec, Symantec Intelligence Report, b-intelligence_ report_08-2013.en-us, Accessed on Aug 2013.
  • WWWsize. Available: http://www.worldwidewebsize.com/, Accessed on Nov 2013.
  • Wiki02. Available: http://en.wikipedia.org/wiki/Machine_ learning, Accessed on 2013.
  • Wiki03. Available: http://en.wikipedia.org/wiki/Feature_ selection, Accessed on 2013.
  • Dmoz open directory
  • Available: www.google.com
  • iwebtool, Available: http://www.iwebtool.com/pagerank_ prediction, Accessed on 2012.
  • WEKA, Available: www.cs.waikato.ac.nz/ml/weka/

Abstract Views: 577

PDF Views: 310




  • Comprehensive Evaluation of Machine Learning Techniques and Novel Features for Web Link Spamdexing Detection

Abstract Views: 577  |  PDF Views: 310

Authors

S. K. Jayanthi
Department of Computer Science, Vellalar College for Women, Erode, India
S. Sasikala
Department of Computer Science, KSR College of Arts and Science, Tiruchengode, India
J. P. Vishnupriya
Kumaraguru College of Technology, Coimbatore, India

Abstract


World Wide Web (WWW) is a huge, dynamic, self-organized, and strongly interlinked source of information. Search engine became a vital IR (Information Retrieval) system to retrieve the required information. Results appearing in the first few pages gain more attraction and importance. Since users believe that they were more relevant because of its top positions. Spamdexing plays a key role in making high rank and top visibility for an undeserved page. This paper focus on two aspects: new features and new classifiers. First, 27 new features which are used to commercially boost the ranking and reputation are considered for classification. Along with them 17 new features were proposed and computed. Totally 44 features were combined with the existing WEBSPAM-UK 2007 dataset which is the baseline. With all these features, feature inclusion study is carried out to elevate the performance. Second aspect considered in this paper is exploring new suite of five different machine learners for the web spam classification problem. Results are discussed. New feature inclusion improves the classification accuracy of the publicly available WEBSPAM-UK 2007 features by 22%. SVM outperforms well than the other methods in terms of accuracy.

Keywords


Decision Table, HMM, Search Engine, SVM, Web Spam.

References





DOI: https://doi.org/10.15613/sijrs%2F2014%2Fv1i2%2F67548