Open Access Open Access  Restricted Access Subscription Access

Unsupervised Extractive News Articles Summarization leveraging Statistical, Topic-Modelling and Graph-based Approaches


Affiliations
1 The Assam Kaziranga University, Jorhat 785 006, Assam, India
2 Girijananda Chowdhury Institute of Management and Technology, Guwahati 781017, Assam, India
3 Gauhati University, Guwahati 781014, Assam, India
 

Due to the presence of large amounts of data and its exponential level generation, the manual approach of summarization takes more time, is biased, and needs linguistic professional experts. To avoid these substantial issues or to generate a succinct summary report, automatic text summarization is very much important. Three different approaches namely the statistical approach such as Term Frequency Inverse Document Frequency (TF-IDF), the topic modeling approach such as Latent Semantic Analysis (LSA), and graph-based approaches such as TextRank were applied to generate a concise summary for the benchmark the British Broadcasting Corporation (BBC) news articles summarization dataset. The domain specific implementations of each approach in the five domains of the dataset and domain-agnostic prospects were explored in the paper while drawing various insights. The generated summaries were evaluated using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) framework, leveraging precision, recall, and f-measure metrics. The approaches were not only able to achieve a commendable ROUGE score but also outperform the previous works on the dataset.

Keywords

LSA, NLP, ROUGE, TextRank, TF-IDF
User
Notifications
Font Size

  • El-Kassas W S, Salama C R, Rafea A A & Mohamed H K, Automatic text summarization: A comprehensive survey, Expert Syst Appl, 165 (2021) 113679, https://doi.org/10.1016/j.eswa.2020.113679.
  • Moratanch N & Chitrakala S, A survey on abstractive text summarization, Int Conf Circuit, Power and Comput Technol (ICCPCT), 2016, 1–7, doi: 10.1109/ICCPCT.2016.7530193.
  • Liu Y, Fine-tune BERT for extractive summarization, arXiv preprint arXiv:1903.10318, (2019).
  • Narayan S, Cohen S B & Lapata M, Ranking sentences for extractive summarization with reinforcement learning, arXiv preprint arXiv:1802.08636, (2018). Proc 2018 Conf North American Chapt Associat Comput Linguist: Human Language Technologies, 2018, 1, 1747–1759.
  • Janaki Raman K & Meenakshi K, Automatic text summarization of article (NEWS) using lexical chains and wordnet—A review, Artif Intell Tech Adv Comput Appl, 130 (2021) 271–282.
  • Ahmad T, Ahmed S U, Ahmad N, Aziz A & Mukul L, News article summarization: Analysis and experiments on basic extractive algorithms, Int J Grid Distrib Comput, 13(2) (2020) 2366–2379.
  • Mihalcea R & Tarau P, Textrank: bringing order into text, Proc 2004 Conf Empiric Method Natur Language Process, July 2004, 404–411.
  • Khan R, Qian Y & Naeem S, Extractive based text summarization using k-means and TF-IDF, Int J Electron Bus, 11(3) (2019) 33.
  • Alami N, En-nahnahi N, Ouatik S A & Meknassi M, Using unsupervised deep learning for automatic summarization of Arabic documents, Arab J Sci Eng, 43(12) (2018) 7803– 7815.
  • Li A, Jiang T, Wang Q & Yu H, The mixture of textrank and lexrank techniques of single document automatic summarization research in Tibetan, in 8th Int Conf Intell Human-Machine Syst Cybernet (IHMSC), August 2016, 1, 514–519.
  • Brin S & Page L, The anatomy of a large-scale hypertextual web search engine, CNIS, 30(1–7) (1998) 107–117.
  • Greene D & Cunningham P, Practical solutions to the problem of diagonal dominance in kernel document clustering, Proc 23rd Int Conf Machine Learn, June 2006, 377–384.
  • Pennington J, Socher R & Mannin C D, Glove: Global vectors for word representation, Proc 2014 Conf Empiric Method Natur Language Process (EMNLP), October 2014, 1532–1543.
  • Lin C Y, Rouge: A package for automatic evaluation of summaries, in Text Summarization Branches Out, July 2004, 74–81.
  • Luhn H P, The automatic creation of literature abstracts, IBM J Res Dev, 2(2) (1958) 159–165.
  • El-Kassas W S, Salama C R, Rafea A A & Mohamed H K, EdgeSumm: Graph-based framework for automatic text summarization, Inf Process Manag, 57(6) 2020 102264.
  • Erkan G & Radev D R, Lexrank: Graph-based lexical centrality as salience in text summarization, Int J Artif Intell, 22 (2004) 457–479.
  • Chen L & Le Nguyen M, Sentence selective neural extractive summarization with reinforcement learning, 11th Int Conf Knowl Syst Eng (KSE), October 2019, 1–5.
  • Devlin J, Chang M W, Lee K & Toutanova K, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, (2018).
  • Understand TextRank for Keyword Extraction by Python | by Xu LIANG | Towards Data Science. https://towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0 (accessed 2021-05-13).
  • Gupta V & Lehal G S, A survey of text summarization extractive techniques, J Emerg Technol Web Intell, 2(3) (2010) 258–268.
  • Lahitani A R, Permanasari A E & Setiawan N A, Cosine similarity to determine similarity measure: Study case in online essay assessment, 4th Int Conf Cyber and IT Service Manag, April 2016, 1–6.
  • Understanding TF IDF (term frequency - inverse document frequency). https://iq.opengenus.org/tf-idf/ (accessed 2021-08-01).
  • NLP — Text Summarization using NLTK: TF-IDF Algorithm | by Akash Panchal | Towards Data Science. https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3 (accessed 2021-08-01)

Abstract Views: 94

PDF Views: 69




  • Unsupervised Extractive News Articles Summarization leveraging Statistical, Topic-Modelling and Graph-based Approaches

Abstract Views: 94  |  PDF Views: 69

Authors

Utpal Barman
The Assam Kaziranga University, Jorhat 785 006, Assam, India
Vishal Barman
Girijananda Chowdhury Institute of Management and Technology, Guwahati 781017, Assam, India
Nawaz Khan Choudhury
Girijananda Chowdhury Institute of Management and Technology, Guwahati 781017, Assam, India
Mustafizur Rahman
Girijananda Chowdhury Institute of Management and Technology, Guwahati 781017, Assam, India
Shikhar Kumar Sarma
Gauhati University, Guwahati 781014, Assam, India

Abstract


Due to the presence of large amounts of data and its exponential level generation, the manual approach of summarization takes more time, is biased, and needs linguistic professional experts. To avoid these substantial issues or to generate a succinct summary report, automatic text summarization is very much important. Three different approaches namely the statistical approach such as Term Frequency Inverse Document Frequency (TF-IDF), the topic modeling approach such as Latent Semantic Analysis (LSA), and graph-based approaches such as TextRank were applied to generate a concise summary for the benchmark the British Broadcasting Corporation (BBC) news articles summarization dataset. The domain specific implementations of each approach in the five domains of the dataset and domain-agnostic prospects were explored in the paper while drawing various insights. The generated summaries were evaluated using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) framework, leveraging precision, recall, and f-measure metrics. The approaches were not only able to achieve a commendable ROUGE score but also outperform the previous works on the dataset.

Keywords


LSA, NLP, ROUGE, TextRank, TF-IDF

References