Open Access Open Access  Restricted Access Subscription Access

A Survey and Classification of Publicly Available COVID-19 Datasets


Affiliations
1 Associate Professor, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India
2 Project Linked Personnel, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India
3 Professor, Machine Intelligence Unit, Indian Statistical Institute, Kolkata, West Bengal, India
 

The current study curates a list of authentic and open-access sources of alphanumeric COVID-19 pandemic data. We have gathered 74 datasets from 42 sources, including sources from 18 countries. The datasets are searched through the Kaggle and GitHub repositories besides Google, providing a representation of varieties of pandemic-related datasets. The datasets are categorized according to their sources- primary and secondary, and according to their geographical distribution. While analyzing the dataset, we came across some classes in which the datasets can be categorized. We present the categorization in the form of taxonomy and highlight the present COVID-19 data collection and use challenges. The study will help researchers and data curators in the identification and classification of pandemic data.

Keywords

COVID-19, Classification, Curation, Datasets, Metadata.
User
Notifications
Font Size

  • World Health Organization COVID-19 Dashboard, Available at https://covid19.who.int/ (Accessed on 14 Nov 2021).
  • Ghosh D, Santra P K, Mahapatra G S, Elsonbaty A, and Elsadany A A, A discrete-time epidemic model for the analysis of transmission of COVID19 based upon data of epidemiological parameters, The European Physical Journal Special Topics, (2022). https://doi.org/10.1140/epjs/s11734-022-00537-2.
  • Zoabi Y, Deri-Rozov S, and Shomron N, Machine learning-based prediction of COVID-19 diagnosis based on symptoms. Npj Digital Medicine, 4 (1) (2021). https:// doi.org/10.1038/s41746-020-00372-6.
  • Jackson S E, Brown J, Shahab L, Steptoe A, and Fancourt D, Covid-19, smoking and inequalities: A study of 53002 adults in the UK, Tobacco Control, 30(e2) (2020) e111-e121. https://doi.org/10.1136/tobaccocontrol-2020-055933.
  • Dutta B, Examining the interrelatedness between ontologies and Linked Data, Library Hi Tech, 35 (2) (2017) 312-331.
  • Dutta B and DeBellis M, CODO: an ontology for collection and analysis of COVID-19 data, In Proceedings of the paper presented at the 12th Int. Conf. on Knowledge Engineering and Ontology Development (KEOD), Lisboa, Portugal, 2-4 November 2020, 2, p.76-85.
  • COVID-19 media bulletin, Available at https://covid19. karnataka.gov.in/govt_bulletin/en (Accessed on 11 April 2022)
  • Covidgraph- A covid-19 knowledge graph. HealthECCO. (November 30 2021). Available at https://healthecco.org/ covidgraph/ (Accessed on 11 April 2022).
  • CSSEGISandData. (n.d.). CSSEGISANDDATA/covid-19: Novel coronavirus (COVID-19) cases, provided by JHU CSSE. GitHub. Available at https://github.com/ CSSEGISandData/COVID-19 (Accessed on 11 April 2022)
  • Ashofteh A and Bravo J M, A study on the quality of novel coronavirus (COVID-19) official datasets, Statistical Journal of the IAOS, 36 (2) (2020) 291–301.
  • Shuja J, Alanazi E, Alasmary W, and Alashaikh A, COVID-19 open-source datasets: a comprehensive survey, Applied Intelligence, 51(3) (2020) 1296–1325.
  • COVID-19 research database. Available at https://covid 19researchdatabase.org/ (Accessed on 11 April 2022).
  • Orlandic L, Teijeiro T, and Atienza D, The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms, Sci Data, 8 (2021) 156.
  • Renear A H, Sacchi S, and Wickett K M, Definitions of the dataset in the scientific and technical literature, In Proceedings of the American Society for Information Science and Technology, 2010 P.1–4.
  • Alamo T, Reina D, Mammarella M, and Abella A, Covid-19: Open-Data Resources for Monitoring, Modeling, and Forecasting the Epidemic. Electronics, 9 (5) (2020) 827.
  • ECDC, Available at https://www.ecdc.europa.eu/en/covid-19/data (Accessed on 4 May 2022).
  • Zuo X., Chen Y, Ohno-Machado L, and Xu H, How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles, Briefings in Bioinformatics, 22(2) (2020) 800–811
  • Cheng Y-Y and Ludäscher B, Through the magnifying glass: Exploring aggregations of COVID-19 datasets by county, state, and taxonomies of U.S. regions. In Proceedings of the Association for Information Science and Technology, 57(1) (2020), p. e355. https://doi.org/10.1002/pra2.355.
  • Ullah S A R, A Survey of COVID-19 Misinformation: Datasets, Detection. Preprint at arxiv:2110.00737v1 (2021).
  • Szmuda T, Ali S, Ozdemir C, Syed M T, Singh A, et al, Datasets and future research suggestions concerning SARS-CoV-2. S. European Journal of Transitional Clinical Medicine. 3(2) (2020) 80-85.
  • Our World in Data, Available at https://ourworldindata.org/ coronavirus (Accessed on 4 May 2022).
  • Wang L L, CORD-19: The Covid-19 Open Research Dataset. https://pubmed.ncbi.nlm.nih.gov/32510522/ (2020).
  • Santos B S, Silva I, Ribeiro-Dantas M D C, Alves G, Endo P T and Lima L, COVID-19: A scholarly production dataset report for research analysis. Data in Brief, 32 (2020) 106178.
  • GitHub, Available at https://github.com/ (Accessed on 4 May 2022).
  • Kaggle, Available at https://www.kaggle.com/ (Accessed on 4 May 2022).
  • World Health Organization, Available at https://www.who.int (Accessed on 4 May 2022).
  • Covid19, Available at https://github.com/pomber/covid19 (Accessed on 4 May 2022).
  • COVID-CT, Available at https://github.com/UCSD-AI4H/ COVID-CT (Accessed on 4 May 2022).
  • Covid-19, Available at https://github.com/k-sys/covid-19 (Accessed on 4 May 2022).
  • Devakumar K P, Covid-19 dataset. Available at https:// www.kaggle.com/imdevskp/corona-virus-report (Accessed on 4 May 2022).
  • Covid-19 in India, Available at https://www.kaggle.com/ sudalairajkumar/covid19-in-india (Accessed on 4 May 2022).
  • International, S. N. O. M. E. D. (n.d.). SNOMED International's SNOMED CT browser. SNOMED International Browser. Available at https://www. snomedbrowser.org/ (Accessed on 11 April 2022).
  • Covid prediction. Available at https://covid19-forecast.org/ (Accessed on 11 April 2022)
  • GISAID. Available at https://www.gisaid.org/ (Accessed on 11 April 2022)
  • EULAR. Available at https://www.eular.org/ eular_covid_19_registry.cfm (Accessed on 11 April 2022)
  • SARSCoV2_South_Africa_major_lineages. Available at https://github.com/krisp-kwazulu-natal/SARSCoV2_South_ Africa_major_lineages/ (Accessed on 11 April 2022)
  • Global.health: A data science initiative. Available at https://data.covid-19.global.health/data-acknowledgments (Accessed on 11 April 2022).
  • Covid19_scenarios_data. Available at https://github.com/ neherlab/covid19_scenarios_data (Accessed on 11 April 2022).
  • Virginia Hospital covid-19 dashboard. Communications. (15 January 2021). Available at https://www.vhha.com/ communications/virginia-hospital-covid-19-data-dashboard/ (Accessed on 11 April 2022).
  • Education: From disruption to recovery. UNESCO. (28 February 2021). Available at https://en.unesco.org/themes/ education-emergencies/coronavirus-school-closures (Accessed on 11 April 2022).
  • Covid-19 government response tracker. Blavatnik School of Government. (n.d.). Available at https://www.bsg.ox.ac.uk/ research/research-projects/covid-19-government-response-tracker (Accessed on 11 April 2022).
  • Latest infection trends in Tokyo. Available at https://stopcovid19.metro.tokyo.lg.jp/ (Accessed on 11 April 2022)
  • Covid-19 info Switzerland. COVID-19 Info. Available at https://www.corona-data.ch/ (Accessed on 11 April 2022).
  • Corona figure dump, Available at http://corona-ch.surge.sh/ (Accessed on 11 April 2022).
  • NCov. Available at https://ncov.vncdc.gov.vn/ (Accessed on 11 April 2022).
  • Adolph, C., Amano, K., Bang-Jensen, B., Fullman, N., & Wilkerson, J., Pandemic politics: Timing state-level social distancing responses to covid-19, Journal of Health Politics, Policy and Law, 46(2) (2021) 211–233. https:// doi.org/ 10.1215/03616878-8802162

Abstract Views: 182

PDF Views: 92




  • A Survey and Classification of Publicly Available COVID-19 Datasets

Abstract Views: 182  |  PDF Views: 92

Authors

Biswanath Dutta
Associate Professor, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India
Puranjani Das
Project Linked Personnel, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India
Sushmita Mitra
Professor, Machine Intelligence Unit, Indian Statistical Institute, Kolkata, West Bengal, India

Abstract


The current study curates a list of authentic and open-access sources of alphanumeric COVID-19 pandemic data. We have gathered 74 datasets from 42 sources, including sources from 18 countries. The datasets are searched through the Kaggle and GitHub repositories besides Google, providing a representation of varieties of pandemic-related datasets. The datasets are categorized according to their sources- primary and secondary, and according to their geographical distribution. While analyzing the dataset, we came across some classes in which the datasets can be categorized. We present the categorization in the form of taxonomy and highlight the present COVID-19 data collection and use challenges. The study will help researchers and data curators in the identification and classification of pandemic data.

Keywords


COVID-19, Classification, Curation, Datasets, Metadata.

References