A Novel Approach for Mining Web Documents Based on Bayesian Learning Classifier Systems
Subscribe/Renew Journal
Web mining is a new area of data mining. Since web is one of the biggest repositories of data, analyzing and exploring regularities using data mining in web user behavior can improve system performance and enhance the quality and delivery of Internet information services to the end user. Clustering and classification have been useful in active areas of machine learning research that promise to help us cope with the problem of information overload on the Internet. BIRCH is a clustering algorithm designed to operate under the assumption "the amount of memory available is limited, whereas the dataset can be arbitrary large". The algorithm generates "a compact dataset summary" minimizing the I/O cost involved Also the effect of noise and uncertainty are major issues in Web mining. Traditionally, probability is used to measure the uncertainty in the system. The Bayesian approach provides a mathematical Bayes’ theorem to manipulate existing beliefs with some new evidence in order to form new beliefs. Bayesian inference has been seen in the literature as a robust method to deal with noise and uncertainty. Therefore, we propose a modification of UCS, using Bayesian update. This method is able to achieve higher accuracy than UCS and requires only half of the learning time to converge. The algorithm thus minimizes the outliers involved and contains enough information to apply the well known SMOKA - Smoothened k-means clustering algorithm to the set of summaries and to generate the partitions of the original dataset. We expect that the proposed method to work more quickly because it reduces the time required exploring a search space and finding a correct action for a condition.
Keywords
Abstract Views: 288
PDF Views: 4