COVID-19 Data: The Long Run

2021-01-06

by Joseph Rickert

The world seems to have moved to a new phase of paying attention to COVID-19. We have gone from pondering daily plots of case counts, to puzzling through models and forecasts, and are now moving on to the vaccines and the science behind them. For data scientists, however, the focus needs to remain on the data and the myriad issues and challenges that efforts to collect and curate COVID data have uncovered. My intuition is that not only will COVID-19 data continue to be important for quite some time in the future, but that efforts to improve the quality of this data will be crucial for successfully dealing with the next pandemic.

An incredible amount of work has been done by epidemiologists, universities, government agencies and data journalists to collect, organize, and reconcile data from thousands of sources. Nevertheless, the experts caution that there is much yet to be done.

Roni Rosenfeld, head of the Machine Learning Department of the School of Computer Science at Carnegie Mellon University and project lead for the Delphi Group put it this way in a recent COPSS-NISS webinar:

Data is a big problem in this pandemic. Availability of high quality, comprehensive, geographically detailed data is very far from where it should be.

There are over 6,000 hospitals in the United States, and over 160,000 hospitals worldwide. Many of these are collecting COVID-19 data yet there few standards for recording cases, dealing with missing data, updating case count data, and coping with the time lag between recording and reporting cases. Nowcasting epidemiological and heath care data has become a vital field of statistical research.

The following slide from the COPSS-NISS webinar shows a hierarchy of relevant COVID data organized on the Severity Pyramid that epidemiologists use to study disease progression.

The Delphi Group is making fundamental contributions to the long term improvement of COVID data by archiving the data shown in such a way that versions can be retrieved by date, and also by collecting massive data sets of leading indicators.

The webinar is well worth watching, and I highly recommend listening through the Q&A session at the end. The speakers explain the importance of nowcasting and Professor Rosenfeld presents a vision of making epidemic forecasting comparable to weather forecasting. It seems to me that this would be a worthwhile project to help advance.

Note that the Delphi’s COVID-19 indicators, probably the nation’s largest public repository of diverse, geographically-detailed, real-time indicators of COVID activity in the US, are freely available through the public API which is easily accessible to R and Python users.

Also note that R users can contribute to R Consortium sponsored COVID related projects that include the COVID-19 Data Hub an organized archive of global COVID-19 case count data, and the RECON COVID-19 Challenge, an open source project to improve epidemiological tools.