Open Access Open Access  Restricted Access Subscription Access

Unsupervised Extractive News Articles Summarization leveraging Statistical, Topic-Modelling and Graph-based Approaches


Affiliations
1 The Assam Kaziranga University, Jorhat 785 006, Assam, India
2 Girijananda Chowdhury Institute of Management and Technology, Guwahati 781017, Assam, India
3 Gauhati University, Guwahati 781014, Assam, India
 

Due to the presence of large amounts of data and its exponential level generation, the manual approach of summarization takes more time, is biased, and needs linguistic professional experts. To avoid these substantial issues or to generate a succinct summary report, automatic text summarization is very much important. Three different approaches namely the statistical approach such as Term Frequency Inverse Document Frequency (TF-IDF), the topic modeling approach such as Latent Semantic Analysis (LSA), and graph-based approaches such as TextRank were applied to generate a concise summary for the benchmark the British Broadcasting Corporation (BBC) news articles summarization dataset. The domain specific implementations of each approach in the five domains of the dataset and domain-agnostic prospects were explored in the paper while drawing various insights. The generated summaries were evaluated using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) framework, leveraging precision, recall, and f-measure metrics. The approaches were not only able to achieve a commendable ROUGE score but also outperform the previous works on the dataset.

Keywords

LSA, NLP, ROUGE, TextRank, TF-IDF
User
Notifications
Font Size


  • Unsupervised Extractive News Articles Summarization leveraging Statistical, Topic-Modelling and Graph-based Approaches

Abstract Views: 276  |  PDF Views: 120

Authors

Utpal Barman
The Assam Kaziranga University, Jorhat 785 006, Assam, India
Vishal Barman
Girijananda Chowdhury Institute of Management and Technology, Guwahati 781017, Assam, India
Nawaz Khan Choudhury
Girijananda Chowdhury Institute of Management and Technology, Guwahati 781017, Assam, India
Mustafizur Rahman
Girijananda Chowdhury Institute of Management and Technology, Guwahati 781017, Assam, India
Shikhar Kumar Sarma
Gauhati University, Guwahati 781014, Assam, India

Abstract


Due to the presence of large amounts of data and its exponential level generation, the manual approach of summarization takes more time, is biased, and needs linguistic professional experts. To avoid these substantial issues or to generate a succinct summary report, automatic text summarization is very much important. Three different approaches namely the statistical approach such as Term Frequency Inverse Document Frequency (TF-IDF), the topic modeling approach such as Latent Semantic Analysis (LSA), and graph-based approaches such as TextRank were applied to generate a concise summary for the benchmark the British Broadcasting Corporation (BBC) news articles summarization dataset. The domain specific implementations of each approach in the five domains of the dataset and domain-agnostic prospects were explored in the paper while drawing various insights. The generated summaries were evaluated using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) framework, leveraging precision, recall, and f-measure metrics. The approaches were not only able to achieve a commendable ROUGE score but also outperform the previous works on the dataset.

Keywords


LSA, NLP, ROUGE, TextRank, TF-IDF

References