As the amount of internet documents has been growing, document clustering has become practically important. This has led the interest in developing document clustering algorithms. Exploiting parallelism plays an important role in achieving fast and high quality clustering. In this paper, we propose a parallel algorithm that adopts a hierarchical document clustering approach. Our focus is to exploit the sources of parallelism to improve performance and decrease clustering time. The proposed parallel algorithm is tested using a test-bed collection of 749 documents from CACM. A multiprocessor system based on message-passing is used. Various parameters are considered for evaluating performance including average inter-cluster similarity, speedup and processors' utilization. Simulation results show that the proposed algorithm improves performance, decreases the clustering time, and increases the overall speedup while still keeping a high clustering quality. By increasing the number of processors, the clustering time decreases till a certain point where any more processors will no longer be effective. Moreover, the algorithm is applicable for different domains for other document collections.
Keywords
Hierarchical Clustering, Parallel Algorithms, Simulation, Document Collection, Performance Evaluation.
User
Font Size
Information