Open Access Open Access  Restricted Access Subscription Access

Efficient Similarity Join Method Using Unsupervised Learning


Affiliations
1 Department of Computer Information Systems, Alzaytoonah University of Jordan, Amman 11733, Jordan
2 Department of Computer Science, Wayne State University, Detroit, MI 48202, United States
3 Department of Computer and Information Science, University of Michigan-Dearborn, Dearborn, MI 48128, United States
 

This paper proposes an efficient similarity join method using unsupervised learning, when no labeled data is available. In our previous work, we showed that the performance of similarity join could improve when long string attributes, such as paper abstracts, movie summaries, product descriptions, and user feedback, are used under supervised learning, where a training set exists. In this work, we adopt using long string attributes during the similarity join under unsupervised learning. Along with its importance when no labeled data exists, unsupervised learning is used when no labeled data is available, it acts also as a quick preprocessing method for huge datasets. Here, we show that using long attributes during the unsupervised learning can further enhance the performance. Moreover, we provide an efficient dynamically expandable algorithm for databases with frequent transactions.

Keywords

Similarity Join, Unsupervised Learning, Diffusion Maps, Databases, Machine Learning.
User
Notifications
Font Size

Abstract Views: 215

PDF Views: 116




  • Efficient Similarity Join Method Using Unsupervised Learning

Abstract Views: 215  |  PDF Views: 116

Authors

Bilal Hawashin
Department of Computer Information Systems, Alzaytoonah University of Jordan, Amman 11733, Jordan
Farshad Fotouhi
Department of Computer Science, Wayne State University, Detroit, MI 48202, United States
William Grosky
Department of Computer and Information Science, University of Michigan-Dearborn, Dearborn, MI 48128, United States

Abstract


This paper proposes an efficient similarity join method using unsupervised learning, when no labeled data is available. In our previous work, we showed that the performance of similarity join could improve when long string attributes, such as paper abstracts, movie summaries, product descriptions, and user feedback, are used under supervised learning, where a training set exists. In this work, we adopt using long string attributes during the similarity join under unsupervised learning. Along with its importance when no labeled data exists, unsupervised learning is used when no labeled data is available, it acts also as a quick preprocessing method for huge datasets. Here, we show that using long attributes during the unsupervised learning can further enhance the performance. Moreover, we provide an efficient dynamically expandable algorithm for databases with frequent transactions.

Keywords


Similarity Join, Unsupervised Learning, Diffusion Maps, Databases, Machine Learning.