Open Access Open Access  Restricted Access Subscription Access

Information Retrieval Based on Cluster Analysis Approach


Affiliations
1 Department of Information Technology, The University of Jordan, Aqaba, Jordan
 

The huge volume of text documents available on the internet has made it difficult to find valuable information for specific users. In fact, the need for efficient applications to extract interested knowledge from textual documents is vitally important. This paper addresses the problem of responding to user queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a cluster-based information retrieval framework was proposed in this paper, in order to design and develop a system for analysing and extracting useful patterns from text documents. In this approach, a pre-processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector Space Model (VSM) is performed to represent the dataset. The system was implemented through two main phases. In phase 1, the clustering analysis process is designed and implemented to group documents into several clusters, while in phase 2, an information retrieval process was implemented to rank clusters according to the user queries in order to retrieve the relevant documents from specific clusters deemed relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision (P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.

Keywords

Cluster Analysis, Documents Analysis, Information Retrieval, Text Mining.
User
Notifications
Font Size

  • Y. Djenouri, A. Belhadi, D. Djenouri and J. Lin, "Cluster-based information retrieval using pattern mining", Applied Intelligence, vol. 51, no. 4, pp. 1888-1903, 2020. Available: https://doi.org/10.1007/s10489-020-01922-x. [Accessed 17 September 2021].
  • M. Hearst, "Issues, Techniques, and the Relationship to information Access", Presentation Notes for UM/MS Workshop on Data Mining, 1997. [Online]. Available: https://people.ischool.berkeley.edu/~hearst/talks/dm-talk/. [Accessed: 17- Sep- 2021].
  • P. Zorn, M. Emanoil, L. Marshall and M. Panek, "Finding needles in the haystack : Mining meets the Web", Online, vol. 23, pp. 16-28, 1999. [Accessed 17 September 2021].
  • S. Weiss, N. Indurkhya, T. Zhang and F. Damerau, Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer-Verlag, 2005.
  • R. Feldman and J. Sanger, The text mining Handbook. London: Cambridge University Press, 2007.
  • H. Karanikas and B. Theodoulidis, "Knowledge discovery in text and text mining software", Centre for Research in Information Management (CRIM), Department of Computation, UMIST, Manchester, UK, 2002.
  • P. Losiewicz, D. Oard and R. Kostoff, "Textual data mining to support science and technology management", Journal of Intelligent Information Systems, vol. 15, no. 2, pp. 99-119, 2000. Available: 10.1023/a:1008777222412 [Accessed 17 September 2021].
  • B. Everitt, S. Landau, M. Leese and D. Stahl, Cluster Analysis, 5th ed. Chichester: Wiley, 2011.
  • C. Manning, P. Raghavan and H. Schütze, Introduction to information retrieval, 1st ed. Cambridge: Cambridge University Press, 2008.
  • D. Tkach, Text Mining Technology: Turning Information into Knowledge. A White Paper from IBM Software Solutions, 1998.
  • H. Guldemir and A. Sengur, "Comparison of clustering algorithms for analog modulation classification", Expert Systems with Applications, vol. 30, no. 4, pp. 642-649, 2006. Available: https://doi.org/10.1016/j.eswa.2005.07.014. [Accessed 17 September 2021].
  • F. Raiber and O. Kurland, "Ranking document clusters using markov random fields", Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 333–342, 2013. Available: 10.1145/2484028.2484042 [Accessed 17 September 2021].
  • K. Naini, I. Altingovde and W. Siberski, "Scalable and Efficient Web Search Result Diversification", ACM Transactions on the Web (TWEB), vol. 10, no. 3, pp. 1-30, 2016. Available: https://dl.acm.org/doi/10.1145/2907948. [Accessed 17 September 2021].
  • A. Bhopale and A. Tiwari, "Swarm optimized cluster based framework for information retrieval", Expert Systems with Applications, vol. 154, no. 2, p. 113441, 2020. Available: https://doi.org/10.1016/j.eswa.2020.113441. [Accessed 17 September 2021].
  • F. Al Omran and C. Treude, "Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments", 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 187-197, 2017. Available: 10.1109/msr.2017.42 [Accessed 17 September 2021].

Abstract Views: 249

PDF Views: 152




  • Information Retrieval Based on Cluster Analysis Approach

Abstract Views: 249  |  PDF Views: 152

Authors

Orabe Almanaseer
Department of Information Technology, The University of Jordan, Aqaba, Jordan

Abstract


The huge volume of text documents available on the internet has made it difficult to find valuable information for specific users. In fact, the need for efficient applications to extract interested knowledge from textual documents is vitally important. This paper addresses the problem of responding to user queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a cluster-based information retrieval framework was proposed in this paper, in order to design and develop a system for analysing and extracting useful patterns from text documents. In this approach, a pre-processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector Space Model (VSM) is performed to represent the dataset. The system was implemented through two main phases. In phase 1, the clustering analysis process is designed and implemented to group documents into several clusters, while in phase 2, an information retrieval process was implemented to rank clusters according to the user queries in order to retrieve the relevant documents from specific clusters deemed relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision (P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.

Keywords


Cluster Analysis, Documents Analysis, Information Retrieval, Text Mining.

References