The COVID19 package, an interface to the COVID-19 Data Hub

2021-12-08

by Emanuele Guidotti

The COVID-19 Data Hub provides a daily summary of COVID-19 cases, deaths, recovered, tests, vaccinations, and hospitalizations for 230+ countries, 760+ regions, and 12000+ administrative divisions of lower level. It includes policy measures, mobility, and geospatial data. This post presents version 3.0.0 of the COVID19 package to seamlessly import the data in R.

Why another package for COVID-19 data

Many packages now exist to retrieve COVID-19 related data from within R. As an example, covid19br retrieves case data for Brazil, covid19sf for San Francisco, covid19us for United States, covid19india for India, covid19italy for Italy, covid19swiss for Switzerland, covid19france for France, and so on. There also other packages, such as coronavirus, that retrieve national-level statistics worldwide from the Center for Systems Science and Engineering at Johns Hopkins University (JHU CCSE). However, national counts only represent a small portion of the available governmental data, and having the information scattered across many packages and different interfaces makes international comparisons of large, detailed outbreak data difficult, and prevents inferences from such data to be effective.

COVID19 is the official package created around COVID-19 Data Hub: a unified database harmonizing open governmental data around the globe at fine-grained spatial resolution. Moreover, as epidemiological data alone are typically of limited use, the database includes a set of identifiers to match the epidemiological data with exogenous indicators and geospatial information. By unifying the access to the data, this database makes it possible to study the pandemic in its global scale with high resolution, taking into account within-country variations, non pharmaceutical interventions, and environmental and exogenous variables.

In particular, this package allows you to download a large set of epidemiological variables, exogenous indicators from World Bank, mobility data from Google and Apple mobility reports, and geospatial information from Eurostat for Europe or GADM worldwide, in, literally, one line of code.

What’s new in version 3.0.0

Version 3 is a major update of COVID-19 Data Hub, which includes a great improvement in the spatial coverage, new data on vaccines, and a new set of identifiers to enable geospatial analyses. The full changelog is available here.

The large amount of data that is now available (~2GB) has led to some breaking changes in the way the data are provided. Version 3 of the COVID19 package is designed to be compatible with the latest version of COVID-19 Data Hub, and process large amount of data at speed with low memory requirements. The documentation and a quick start guide is available here.

Data coverage

The figure shows the granularity and the spatial coverage of the data as of November 27, 2021.

What’s included?

library(COVID19)  # load the package
x <- covid19()    # download the data

Refer to the documentation for the description of each variable.

colnames(x)

##  [1] "id"                                  "date"                               
##  [3] "confirmed"                           "deaths"                             
##  [5] "recovered"                           "tests"                              
##  [7] "vaccines"                            "people_vaccinated"                  
##  [9] "people_fully_vaccinated"             "hosp"                               
## [11] "icu"                                 "vent"                               
## [13] "school_closing"                      "workplace_closing"                  
## [15] "cancel_events"                       "gatherings_restrictions"            
## [17] "transport_closing"                   "stay_home_restrictions"             
## [19] "internal_movement_restrictions"      "international_movement_restrictions"
## [21] "information_campaigns"               "testing_policy"                     
## [23] "contact_tracing"                     "facial_coverings"                   
## [25] "vaccination_policy"                  "elderly_people_protection"          
## [27] "government_response_index"           "stringency_index"                   
## [29] "containment_health_index"            "economic_support_index"             
## [31] "administrative_area_level"           "administrative_area_level_1"        
## [33] "administrative_area_level_2"         "administrative_area_level_3"        
## [35] "latitude"                            "longitude"                          
## [37] "population"                          "iso_alpha_3"                        
## [39] "iso_alpha_2"                         "iso_numeric"                        
## [41] "iso_currency"                        "key_local"                          
## [43] "key_google_mobility"                 "key_apple_mobility"                 
## [45] "key_jhu_csse"                        "key_nuts"                           
## [47] "key_gadm"

Data transparency

This package applies no pre-processing to the original data, that are provided as-is. The data acquisition pipeline is open source and all the original data providers are listed here.

As an example, the following code snippet plots the fraction of confirmed cases on a given day per number of tests performed on that day in U.S. Notice that around June 2021, the fraction becomes negative. This is a known issue due to decreasing cumulative counts in the original data provider. This package applies no cleaning procedure for this kind of issues, which are typically due to changes in the data collection methodology. If the provider corrects the data retroactively, the changes are reflected in this package.

library(xts)
library(dygraphs)
x <- covid19("USA", verbose = FALSE)  # download the data
ts <- xts(x[,c("confirmed", "tests")], order.by = x$date)  # convert to an xts object
ts$ratio <- diff(ts$confirmed) / diff(ts$tests)  # compute daily ratio
dygraph(ts$ratio, main = "Daily fraction confirmed/tests in U.S.")  # plot

World Bank data

Country-level covariates by World Bank Open Data can be easily added. Refer to the table at the bottom of this page for relevant indicators. The following code snippet shows e.g., how to download the number of hospital beds for each country. Refer to the quickstart guide for more details.

x <- covid19(wb = c("hosp_beds" = "SH.MED.BEDS.ZS"), verbose = FALSE)

Mobility data

Mobility data are obtained from Google and Apple mobility reports. The following example shows how to download the data by Google. Refer to the quickstart guide for Apple’s reports and for more details.

x <- covid19(gmr = TRUE, verbose = FALSE)
colnames(x[,48:53])

## [1] "retail_and_recreation_percent_change_from_baseline"
## [2] "grocery_and_pharmacy_percent_change_from_baseline" 
## [3] "parks_percent_change_from_baseline"                
## [4] "transit_stations_percent_change_from_baseline"     
## [5] "workplaces_percent_change_from_baseline"           
## [6] "residential_percent_change_from_baseline"

Spatial data

The dataset contains NUTS codes to match the Eurostat database for Europe, and GID codes to match the GADM worldwide database. The following example shows how to access spatial data using GADM for U.S. counties. Similar maps are available worldwide for most other countries at the various granularity levels.

First, download level 3 data for U.S.

x <- covid19("USA", level = 3, verbose = FALSE)

GADM data by country can be found here. Download the geopackage for U.S. using GADM version 3.6:

url <- "https://biogeo.ucdavis.edu/data/gadm3.6/gpkg/gadm36_USA_gpkg.zip"
zip <- tempfile()
download.file(url, destfile = zip)

Unzip the geopackage:

exdir <- tempfile()
unzip(zip, exdir = exdir)

Load the sf package and list the layers:

library(sf)
file <- paste0(exdir, "/gadm36_USA.gpkg")
st_layers(file)

## Driver: GPKG 
## Available layers:
##     layer_name geometry_type features fields
## 1 gadm36_USA_2 Multi Polygon     3148     13
## 2 gadm36_USA_1 Multi Polygon       51     10
## 3 gadm36_USA_0 Multi Polygon        1      2

Read layer 2 that corresponds to U.S. counties. Note: in general, there is not a perfect correspondence between GADM layers and the granularity level from this package. It is recommended to read all the layers, and match on the corresponding GID. Read more about how key_gadm from this package is mapped to the corresponding GID here.

g <- st_read(file, layer = "gadm36_USA_2")

## Reading layer `gadm36_USA_2' from data source 
##   `/private/var/folders/w0/skxpg0h51jg72m5b01y_8n_c0000gn/T/RtmpmUy7P6/file56232c611dfd/gadm36_USA.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 3148 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2 ymin: 18.91 xmax: 179.8 ymax: 72.69
## Geodetic CRS:  WGS 84

Subset the data to extract only the counts as of, e.g., 15 November 2021. Select only the administrative divisions inside the following bounding box for better visualization.

x <- x[
  x$date == "2021-11-15" & 
  x$latitude > 24.9493 & x$latitude < 49.5904 &
  x$longitude > -125.0011 & x$longitude < -66.9326,]

Merge COVID-19 data with the spatial data:

library(dplyr)
gx <- right_join(g, x, by = c("GID_2" = "key_gadm"))

Plot e.g., the total number of confirmed cases:

plot(gx["confirmed"], logz = TRUE)

Academic publications

See the publications that use COVID-19 Data Hub.

Cite as

Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.

A BibTeX entry for LaTeX users is

@Article{,
    title = {COVID-19 Data Hub},
    year = {2020},
    doi = {10.21105/joss.02376},
    author = {Emanuele Guidotti and David Ardia},
    journal = {Journal of Open Source Software},
    volume = {5},
    number = {51},
    pages = {2376}
}