Reproducible Environments

by Sean Lopp

Great data science work should be reproducible. The ability to repeat experiments is part of the foundation for all science, and reproducible work is also critical for business applications. Team collaboration, project validation, and sustainable products presuppose the ability to reproduce work over time.

In my opinion, mastering just a handful of important tools will make reproducible work in R much easier for data scientists. R users should be familiar with version control, RStudio projects, and literate programming through R Markdown. Once these tools are mastered, the major remaining challenge is creating a reproducible environment.

An environment consists of all the dependencies required to enable your code to run correctly. This includes R itself, R packages, and system dependencies. As with many programming languages, it can be challenging to manage reproducible R environments. Common issues include:

• Code that used to run no longer runs, even though the code has not changed.
• Being afraid to upgrade or install a new package, because it might break your code or someone else’s.
• Typing install.packages in your environment doesn’t do anything, or doesn’t do the right thing.

These challenges can be addressed through a careful combination of tools and strategies. This post describes two use cases for reproducible environments:

1. Safely upgrading packages
2. Collaborating on a team

The sections below each cover a strategy to address the use case, and the necessary tools to implement each strategy. Additional use cases, strategies, and tools are presented at https://environments.rstudio.com. This website is a work in progress, but we look forward to your feedback.

Upgrading packages can be a risky affair. It is not difficult to find serious R users who have been in a situation where upgrading a package had unintended consequences. For example, the upgrade may have broken parts of their current code, or upgrading a package for one project accidentally broke the code in another project. A strategy for safely upgrading packages consists of three steps:

1. Isolate a project
2. Record the current dependencies

The first step in this strategy ensures one project’s packages and upgrades won’t interfere with any other projects. Isolating projects is accomplished by creating per-project libraries. A tool that makes this easy is the new renv package. Inside of your R project, simply use:

# inside the project directory
renv::init()

The second step is to record the current dependencies. This step is critical because it creates a safety net. If the package upgrade goes poorly, you’ll be able to revert the changes and return to the record of the working state. Again, the renv package makes this process easy.

# record the current dependencies in a file called renv.lock
renv::snapshot()

# commit the lockfile alongside your code in version control
# and use this function to view the history of your lockfile
renv::history()

# if an upgrade goes astray, revert the lockfile
renv::revert(commit = "abc123")

# and restore the previous environment
renv::restore()

With an isolated project and a safety net in place, you can now proceed to upgrade or add new packages, while remaining certain the current functional environment is still reproducible. The pak package can be used to install and upgrade packages in an interactive environment:

# upgrade packages quickly and safely
pak::pkg_install("ggplot2")

The safety net provided by the renv package relies on access to older versions of R packages. For public packages, CRAN provides these older versions in the CRAN archive. Organizations can use tools like RStudio Package Manager to make multiple versions of private packages available. The “snapshot and restore” approach can also be used to promote content to production. In fact, this approach is exactly how RStudio Connect and shinyapps.io deploy thousands of R applications to production each day!

Team Collaboration

A common challenge on teams is sharing and running code. One strategy that administrators and R users can adopt to facilitate collaboration is shared baselines. The basics of the strategy are simple:

1. Administrators setup a common environment for R users by installing RStudio Server.
2. On the server, administrators install multiple versions of R.
3. Each version of R is tied to a frozen repository using a Rprofile.site file.

By using a frozen repository, either administrators or users can install packages while still being sure that everyone will get the same set of packages. A frozen repository also ensures that adding new packages won’t upgrade other shared packages as a side-effect. New packages and upgrades are offered to users over time through the addition of new versions of R.

Frozen repositories can be created by manually cloning CRAN, accessing a service like MRAN, or utilizing a supported product like RStudio Package Manager.

The prior sections presented specific strategies for creating reproducible environments in two common cases. The same strategy may not be appropriate for every organization, R user, or situation. If you’re a student reporting an error to your professor, capturing your sessionInfo() may be all you need. In contrast, a statistician working on a clinical trial will need a robust framework for recreating their environment. Reproducibility is not binary!