An R View into Epidemiology

by Joseph Rickert

If you have been tracking the numbers for the COVID-19 pandemic, you must have looked at dozens of models and tried to make some comparisons. Even under the best of situations it is difficult to compare models, and this is especially true if you don’t have sufficient domain knowledge. Experts tend to leave out assumptions and background material that they know other experts will take for granted. This leaves newcomers pretty much on their own.

It has been my experience that a good way for an R literate person to begin to acquire knowledge in a new field is to find some appropriate packages, study the vignettes, work through the examples, and read whatever source material they may reference. So, this post shows how one might go about finding those appropriate packages. Also, I thought it would be interesting to see what kind of special resources are available to epidemiologists working in R beyond the basic statistical infrastructure and packages for data manipulation and visualization.

Because there is no epidemiology task view, a good place to start is to search CRAN directly. (Note that there are task views on differential equations, spatial statistics, time series and other tools used by epidemiologists, so I confined my search to the basics.)

The two main packages I used to search were: pkgsearch which searches CRAN package data and dlstats which retrieves package download information from the RStudio mirror. The pkg_search() function takes a query string as input and returns information about packages that match the query along with some basic information, a score, and the number of downloads in the previous month. Finding a useful string that returns a reasonable list of packages requires developing some hunting skills gained through iterating over plausible strings.

Epi <- pkg_search(query="epidemiology epidemic",size=200)

On the day I did the search the the above query returned a list of 98 packages.

Then, the parametersscore, a measure of accuracy, and downloads_last_month, a proxy for quality, can help filter down to a short list of packages to examine.

as_tibble(Epi) %>% 
  filter(score >= 10, downloads_last_month > 830) %>% 
  select(package,title,downloads_last_month) %>% 
  arrange(-downloads_last_month) -> df

print(df, n = nrow(df))
# A tibble: 23 x 3
   package      title                                         downloads_last_mo…
   <chr>        <chr>                                                      <int>
 1 epitools     "Epidemiology Tools"                                        8480
 2 Epi          "A Package for Statistical Analysis in Epide…               8212
 3 epiR         "Tools for the Analysis of Epidemiological D…               5888
 4 EpiEstim     "Estimate Time Varying Reproduction Numbers …               5477
 5 epiDisplay   "Epidemiological Data Display Package"                      4707
 6 table1       "Tables of Descriptive Statistics in HTML"                  3640
 7 EpiModel     "Mathematical Modeling of Infectious Disease…               3502
 8 haplo.stats  "Statistical Analysis of Haplotypes with Tra…               2723
 9 SpatialEpi   "Methods and Data for Spatial Epidemiology"                 2502
10 R0           "Estimation of R0 and Real-Time Reproduction…               2468
11 popEpi       "Functions for Epidemiological Analysis usin…               2081
12 epitrix      "Small Helpers and Tricks for Epidemics Anal…               1993
13 surveillance "Temporal and Spatio-Temporal Modeling and M…               1986
14 EpiContactT… "Epidemiological Tool for Contact Tracing"                  1181
15 EpiCurve     "Plot an Epidemic Curve"                                    1173
16 epibasix     "Elementary Epidemiological Functions for Ep…               1126
17 epicontacts  "Handling, Visualisation and Analysis of Epi…               1030
18 pubh         "A Toolbox for Public Health and Epidemiolog…                997
19 powerSurvEpi "Power and Sample Size Calculation for Survi…                963
20 mem          "The Moving Epidemic Method"                                 867
21 epimdr       "Functions and Data for \"Epidemics: Models …                867
22 DSAIDE       "Dynamical Systems Approach to Infectious Di…                861
23 episensr     "Basic Sensitivity Analysis of Epidemiologic…                855

Most of the packages in the short list turned out to be “professional” packages in the sense that they provide essential functions but are rather light on documentation. These are targeted towards working professionals. So, while most of the packages found are for the experts, my search did turn up a few for self study. The package epimdr, for example, is associated with Bjornstad’s book Epidemics: Models and Data in R as well as the Coursera course Epidemics - the Dymanics of Infectious Diseases. And, the vignette for the epiR package references the free CDC online course Principles of Epidemiology in Public Health Practice.

Six of the packages on the short list: DSAIDE, epicontacts,EpiEstim, EpiModel, epitrix, andsurveillance have all either been developed or authorized by the R Epidemics Consortium (RECON), an international not-profit organization with a mission to “create the next generation of analytics tools for informing the response to disease outbreaks, health emergencies and humanitarian crises, using the R software and other free, open-source resources”. This group not only develops software and builds models, but members go onsite to help fight disease outbreaks. These packages are mostly very well documented and useful to experts and students alike.

The DSAIDE package provides a tutorial on infectious diseases.

epicontacts provides a collection of tools for representing epidemiological contact data.

EpiEstim is targeted towards estimating time varying reproduction numbers from epidemic curves.

The EpiModel package, which is documented with a JSS paper and it’s own tutorial website, provides a number of advanced epidemiological models including deterministic compartmental models, stochastic individual contact models and network models which go beyond the simple assumption of random contact among all members in a compartment. EpiModel was featured in Tim Churches March Post.

The epitrix package contains a number of utility functions for infectious disease modeling including a function to anonymize data.

The surveillance package which supports spatio-temporal analysis is well documented with seven vignettes.

Finally, taking a look at package download history indicates which packages continue to be useful over time, and in this case, provides some idea of the demand for infectious disease modeling. Here is the download history of the top five packages on the short list.

# Get download history for top 5
top_5 <- df %>% slice(1:5)
dl_stats <- cran_stats(top_5$package)

p <- ggplot(dl_stats, aes(end,downloads, colour=package)) + 
       geom_line() + xlab("Month") +
       ggtitle("Download History of Top 5 Epidemiology Packages") 
    
fig <- ggplotly(p)
fig

I’ll close with this: if you are R literate, you can be pretty confident that you will be able to find tutorials, models and reference implementations to help you learn something about any field that benefits from statistical analysis. If you are an expert in the field, there will be something for you too.

When looking for R packages for a particular application, first look to see if there is a task view. If not, R provides some pretty good tools to help you search.

Share Comments · · · · ·

You may leave a comment below or discuss the post in the forum community.rstudio.com.