Linux on R Views

Setting up RStudio Server on a Cloud for Collaboration and Reproducibility

Wed, 17 Apr 2019 00:00:00 +0000

Roland Stevenson is a data scientist and consultant who may be reached on Linkedin.

When setting up R and RStudio Server on a cloud Linux instance, some thought should be given to implementing a workflow that facilitates collaboration and ensures R project reproducibility. There are many possible workflows to accomplish this. In this post, we offer an “opinionated” solution based on what we have found to work in a production environment. We assume all development takes place on an RStudio Server cloud Linux instance, ensuring that only one operating system needs to be supported. We will keep the motivation for good versioning and reproducibility short: R projects evolve over time, as do the packages that they rely on. R projects that do not control package versions will eventually break and/or not be shareable or reproducible¹.

Since R is a slowly evolving language, it might be reasonable to require that a particular Linux instance have only one version of R installed. However, requiring all R users to use the same versions of all packages to facilitate collaboration is clearly out of the question. The solution is to control package versions at the project level.

We use packrat to control package versions. Already integrated with RStudio Server, packrat ensures that all installed packages are stored with the project², and that these packages are available when a project is opened. With packrat, we know that project A will always be able to use ggplot2 2.5.0 and project B will always be able to use ggplot2 3.1.0. This is important if we want to be able to reproduce results in the future.

On Linux, packrat stores compiled packages in packrat/lib/<LINUX_FLAVOR>/<R_VERSION>, an R-version-specific path, relative to the project’s base directory. An issue arises if we are using R version 3.5.0 one week and then upgrade to R 3.5.1 the next week: a packrat project will not find the 3.5.0 libraries anymore, and we will need to rebuild all the packages to install them in the 3.5.1 path. packrat will automatically build all packages from source (sources are stored in packrat/src) if it notices they are missing. However, this process can take tens of minutes, depending on the number of packages being built. Since this can be cumbersome when collaborating, we also opt to include the packrat/lib path in version control, thereby committing the compiled libraries as well.

Our solution is to bind one fixed R version to an instance³ and release fixed-R instance images periodically. We prefer limited, consistent R-versions over continually upgrading to the most recent version of R. This approach helps to ensure reproducibility and make collaboration easier, avoids having to use docker containers⁴. While binding a fixed version of R to an instance may seem restrictive, we have found that it is in fact quite liberating. Since we only update the existing R version infrequently (think once a year), the barrier of agreeing on an R-version is removed and with it any need to agree on package versions at the user level. Instead, packages are distributed with the project via git. The benefits of fixing the R version for a particular instance are:

Sharing packrat projects and reproducing results are both made easier, since pre-compiled libraries are included with the projects.
Fixing the R-version on an instance doesn’t keep us from upgrading R for a project, as packrat will automatically build and install libraries if an upgraded version is detected. In this way, a project can be opened on an instance with an upgraded R version and have its libraries compiled. Our limited instance image release schedule means the overhead to handle this only occurs at a maximum of once each year.
It is very unlikely that results will be different across R-versions, however being able to tie project results to one R-version allows us to upgrade R for a project while ensuring that results remain as expected.

What we lose by not being on the bleeding edge of (thankfully relatively non-critical) bug fixes we gain in ease of collaboration. Here’s what we’ve done to accomplish this:

rstudio-instance contains branches with scripts to set up a Linux instance with fixed R and RStudio versions. We git clone the repo and git checkout the branch suitable for the Linux flavor, R-version, and RStudio version we want. The scripts also ensure R is not auto-updated in the future.
We then run the install script to set up the instance and archive an image of it for future use.
Once the fixed-R instance is set up, rstudio-project contains an R-version specific base project with pre-built, packrat-managed, fixed-versions of many popular data-science packages⁵.
We git clone rstudio-project to a new project directory locally and remove the existing .git directory so that it can be turned it into a new git repo with git init.
We open the project in RStudio and begin work. All packages are pre-built, so we don’t have to go through lengthy installs. We can upgrade packages in the packrat library of the “Packages” tab, and then run packrat::snapshot() to save any libraries and ugrades into the project’s packrat/ directory. We can then git add packrat to add any packrat updates to the project’s git repo.
If we ever need to duplicate results, we can always build the same fixed-R instance (or clone the image we stored earlier), clone the project on the instance, and know that it will work exactly the same as when we previously worked on it… sometimes years earlier.

Here is a quick example script showing the workflow:

git clone git@github.com:ras44/rstudio-instance.git
cd rstudio-instance
git checkout centos7_R3.5.0_RSS1.1.453
./install.sh
sudo passwd <USERNAME> # set user password for RStudio Server login
cd
git clone git@github.com:ras44/rstudio-project.git new-project
cd new-project
git checkout dev-linux-centos7-R3.5.0
rm -rf .git
git init

Finally, here are some issues with packrat that we have run into along with our solutions. Note that RStudio support has been very helpful in addressing issues while monitoring and providing solutions via their github issue tracker.

If R crashes and the packrat libraries are not accessible after the RStudio restarts the session, the project might need to be re-opened. Run .libPaths() to ensure the project library paths are correct. Verify libraries are accessible by looking at the “packages” tab in RStudio Server and ensuring a “Project Library” header exists with all packages(see above image). Follow issue discussion.
An issue can arise when some packages are updated but others aren’t. This can be challenging to troubleshoot and raises the question of what to do when package versions become incompatible with each other. This is not packrat, but version compatibility.
Installing packages directly from a private/internal github is evolving. An easy solution exists: simply clone the package to a local directory such as ~/local_repos/. Then use install_local() to install from the local_repos directory. See issue for details.
packrat can occasionally have very slow snapshots, particularly with projects that contains many R-Markdown files and packages. This is likely due to packrat dependency searches. As discussed in the issue, we resolve it by ignoring all of our source directories with packrat::set_opts(ignored.directories=c("all","my","R","src","directories") and then running packrat::snapshot(ignore.stale=TRUE, infer.dependencies=FALSE).

Unless you somehow exclusively use packages that are never updated, never implement version-breaking/major version updates, or always provide backwards-compatible version upgrades. Many R packages are in major version 0, meaning there is no guarantee that a future release will maintain the same API. ↩
In the packrat/ directory ↩
It is possible to have multiple R versions installed on a system. I have avoided that for simplicity. ↩
docker containers may be a good alternate solution, but in this case we are not using them. ↩
rstudio-project contains all packages in the anaconda distribution and more. ↩

Analytics Administration for R

Wed, 21 Jun 2017 00:00:00 +0000

Analytic administrator is a role that data scientists assume when they onboard new tools, deploy solutions, support existing standards, or train other data scientists. It is a role that works closely with IT to maintain, upgrade, and scale analytic environments. Analytic admins have a multiplier effect - as they go about their work, they influence others in the organization to be more effective. If you are a data scientist using R, you might consider filling the role of analytic admin for your organization.

Consider the data scientist who wants to make R a legitimate part of their organization. This person has to introduce a new technology and help IT build the architecture around it. In this role, the data scientist – acting as an analytic admin – influences their entire organization.

The need for analytic admins

What organizations need analytic admins? Analytic admins are important for any organization that wants to:

Modernize their analytic tools
Take advantage of all their data
Build analytic products and applications
Develop a best-in-class data science team

Despite the fact that the need for analytic admins is pervasive in industry, companies rarely list it as a dedicated role. Instead, they require teamwork between data science and IT operations, or they may require data scientists to function as their own admins. But the need is real. Most organizations need help bridging the gap between data science and IT. If you see an opportunity to function in the capacity as an analytics admin, I suggest you take it.

Analytic admins typically have to train themselves and carve out their own career. It is common for data scientists who operate as analytic admins to feel as though they are in no-man’s land. It is natural to feel lost between the worlds of data science and information technology. As someone who had been there, I can say the feeling is disorienting. However, I can also say the value of that position is tremendous. If you feel like you are operating in no-man’s land as you function in this role, just know you are exactly where you need to be.

R tooling and integration

At RStudio, we think about doing data science as a development process that begins with accessing and understanding your data, and then communicating your results. This process is thoroughly explained in the book R for Data Science, by Wickham and Grolemond.

RStudio builds open-source and enterprise-ready products to help you do data science in R. These products include the RStudio IDE, RStudio Connect, and Shiny Server. These are designed to work with open-source R packages like Shiny, R Markdown, and the Tidyverse.

Most of the software that RStudio makes is open source, but enterprises often require additional professional features. Common Professional features are security, authentication, high availability, administration, and load balancing.

R is also used with production environments for hosting web applications, exposing APIs, and automating workflows. R is sometimes integrated into other systems such as data warehouses, Hadoop, and Spark. The role of the analytic admin is to provide tooling for data scientists, as well as to integrate R into production systems.

Linux and R

RStudio products run on Linux, so understanding Linux will help you become self-sufficient, use R with other systems, and build better solutions. We will talk more about what you can do with Linux commands in an upcoming blog post.

There are many resources for learning Linux online. Here is just one offered by the Linux Foundation. Analytics admins need to know how to navigate (e.g., cd, pwd, ls), install Linux packages (e.g., apt-get install), and execute commands as root (e.g., sudo). Also important are tab completion, keyboard shortcuts, and text editors (e.g., vim, nano).

Did you know you can execute basic Linux commands from inside RStudio Server using the Tools > Shell option? You can also execute Linux commands inside the R console with the system function.

Another major benefit of learning Linux is the ability to administer production systems that run with Shiny Server, and the ability to deploy Shiny web applications into production.

Running Shiny in production

There is a growing trend in using Shiny web apps in production analytic workflows. The vibrant Shiny community now spans all verticals including pharmaceuticals, high technology, and finance. For many organizations, adopting Shiny is their first experience in running R in production.

Production environments that depend on Shiny also need analytic admins who can deploy and support these applications. For example, some organizations now have complex Shiny applications that serve hundreds of end users over a cluster of load-balanced Shiny Servers. These applications often go through a standard development > test > production deployment process. New tools are being built for correctness testing and load testing in Shiny. RStudio and other platform vendors are making significant investments in building architectures - like Shiny Server and RStudio Connect - that will help Shiny grow over the long term.

The growth of Shiny opens an opportunity to analytic admins who want to make analytic content available to a wide audience. Shiny apps allow end users who know nothing about R to take advantage of the power of the R programming language. They have the potential to influence decision-makers who can take actions and see results based on the work data scientists share with them. There is an immediate need for analytic admins who understand Shiny and can help support environments that depend on Shiny.

Getting started: Installing RStudio Server

A great way to get started learning analytics administration is to build your own open source RStudio Server on Linux. Building an RStudio Server by hand is the analytic admin equivalent of the Jedi building their own light sabers. It’s a core skill, so you should be able to do it yourself no matter what.

An easy way to get started with RStudio Server is to set it up on Ubuntu with Amazon Web Services. AWS even has an instruction guide for running R on AWS. The core commands of the install are the following four lines of code (note: this installs RStudio Server version 1.0.143).

$ sudo apt-get install r-base
$ sudo apt-get install gdebi-core
$ wget https://download2.rstudio.org/rstudio-server-1.0.143-amd64.deb
$ sudo gdebi rstudio-server-1.0.143-amd64.deb

Of course, your installation is going to require more than just installing RStudio Server. You will probably want use the CRAN repository, install Linux dependencies, add users, and manage R packages. Here is a complete script I used to set up RStudio Server on a simple AWS AMI (ami-efd0428f) using a T2-medium instance. I included instructions from this document on how to install R from CRAN. I also opened port 8787 in my AWS security group so I could log into RStudio Server via my web browser.

### Simple RStudio Server Install
### Based on AWS image: ami-efd0428f
 
## Install R from CRAN repository
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
$ sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial/'
$ sudo apt-get update
$ sudo apt-get -y install r-base
 
## Install RStudio Server version 1.0.143
$ sudo apt-get install gdebi-core
$ wget https://download2.rstudio.org/rstudio-server-1.0.143-amd64.deb
$ sudo gdebi rstudio-server-1.0.143-amd64.deb
 
## Add a new user
$ sudo useradd -m myuser
$ sudo passwd myuser
 
## (Optional - may take time) Install common Linux dependencies
$ sudo apt-get -y install libcurl4-openssl-dev openssl libssl-dev
$ sudo apt-get -y install texlive texlive-latex-extra libxml2-dev
 
## (Optional - may take time) Install common R packages
$ sudo Rscript -e 'install.packages("shiny", repos = "http://cran.rstudio.com/")'
$ sudo Rscript -e 'install.packages("tidyverse", repos = "http://cran.rstudio.com/")'
 
## Point your browser to <AWS-instance-IP>:8787

If you don’t want to install RStudio Server from scratch, there are other ways to get started. One is to use a community AMI like this one. Another is to use the AWS Marketplace to install RStudio Server Pro with 1-Click Launch.

Conclusion

Installation is just the first step to administering R. You should also consider the topics of authentication, security, scale, integration, hardware sizing, and configuration. Systems administrators have to do a lot of their own training, and analytic admins are no different. Fortunately, there are plenty of references to help you get started. Here are a few useful references for learning analytic administration for R, RStudio, and Shiny.

RStudio Products

Authentication and security

Managing R Packages

Shiny Server

R and Singularity

Wed, 29 Mar 2017 00:00:00 +0000

R (https://www.r-project.org) is a premier system for statistical and scientific computing and data science. At its core, R is a very carefully curated high-level interface to low-level numerical libraries. True to this principle, R packages have greatly expanded the scope and number of these interfaces over the years, among them interfaces to a large number of distributed and parallel computing tools. Despite its impressive breadth of sophisticated high-performance computing (HPC) tools, R is not often that widely used for “big” problems.

I believe the idiosyncrasies of most HPC technologies represent the major road block to their adoption (in any language or system). HPC technologies are often difficult to set up, use, and manage. They often rely on frequently changing and complex software library dependencies, and sometimes highly specific library versions. Managing all this boils down to spending more time on system administration, and less time on research.

How do we make things easier? One approach to help accelerate the adoption of HPC technology by the R community uses Singularity, a modern application containerization technique suited to HPC (http://singularity.lbl.gov/).

Containers

A container is a collection of the software requirements to run an application. Importantly, containers are defined and generated from a simple text recipe that can be easily communicated and versioned. Containers leverage modern operating system capabilities for virtualizing process and name spaces in a high-performance, low-overhead way. Container technology allows us to quickly turn recipes into runnable applications, and then deploy them anywhere.

The success of Docker, CoreOS, and related systems in enterprise business applications shows that there is a huge demand for lightweight, versionable, and portable containers. Notably, these technologies have not been all that widely successful in HPC settings, despite significant effort. Shifter (https://github.com/NERSC/shifter) is the most successful application of Docker to HPC, and while it is very impressive, it suffers from a few important drawbacks. The root-capable daemon program used by Docker is difficult to accommodate in many HPC environments. And the relatively heavy-weight nature of Docker virtualization can degrade the performance of high-performance hardware resources like Infiniband networking.

Singularity is a lightweight and very simple container technology that is particularly well-suited to HPC environments. Singularity virtualizes the minimum amount necessary to compute, allowing applications full access to fast hardware resources like Infiniband networks and GPUs. And Singularity runs without a server at all, eliminating possible server security exploits. The minimalist philosophy of Singularity makes it easy to install and run on everything from laptops to supercomputers, promoting the ability to quickly test containers before using them across large systems. Singularity is now widely available in supercomputer centers across the world.

Reproducible research

Publishing results with code and data that can be reproduced and validated by others is an obviously important concept that has seen increased urgency these days. The idea is an old one that has been supported by S, S+ and R from the beginning with ideas like Sweave and more recently knitr and R markdown. R even promotes reproducible simulation in distributed/parallel settings by including high-quality, reproducible, distributed random number generators out of the box.

However, as R integrates with an increasing number of external libraries and frameworks like cuDNN, Spark, and others, the ability to reproduce the software environment that R runs in is becoming both more important and more complex. Containers help us define these complex set ups with simple, versionable text files, and then portably run them in diverse environments.

Examples

The following examples assume that Singularity is installed on your system. See http://singularity.lbl.gov/ for details – it’s very easy to install. The examples can be run from nearly any modern Unix operating system, although the processor architecture must be supported by the container operating system.

Hello TensorFlow

The first example below shows a canonical “hello world” program. Instead of a completely trivial example, we print “Hello, TensorFlow!” using TensorFlow from R via Python (https://github.com/tensorflow/tensorflow, https://github.com/python/cpython), introducing a complex but typical software dependency chain. A test program validates operation by printing the “hello world” message from R through Tensorflow. The container generically will run any R program named main.R in its working directory.

Here is the Singularity container definition file for the example using the Ubuntu Xenial operating system. (Note that you can build a container from this definition file on any Singularity-supported operating system.)

BootStrap: debootstrap
OSVersion: xenial
MirrorURL: http://archive.ubuntu.com/ubuntu/

%post
  sed -i 's/main/main restricted universe/g' /etc/apt/sources.list
  apt-get update

  # Install R, Python, misc. utilities
  apt-get install -y libopenblas-dev r-base-core libcurl4-openssl-dev libopenmpi-dev openmpi-bin openmpi-common openmpi-doc openssh-client openssh-server libssh-dev wget vim git nano git cmake  gfortran g++ curl wget python autoconf bzip2 libtool libtool-bin python-pip python-dev
  apt-get clean
  locale-gen en_US.UTF-8

  # Install Tensorflow
  pip install tensorflow

  # Install required R packages
  R --slave -e 'install.packages("devtools", repos="https://cloud.r-project.org/")'
  R --slave -e 'devtools::install_github("rstudio/tensorflow")'

%test
  #!/bin/sh
  exec R --slave -e "library(tensorflow); \
                     sess  <- tensorflow::tf\$Session(); \
                     hello <- tensorflow::tf\$constant('Hello, TensorFlow!'); \
                     sess\$run(hello)"


%runscript
  #!/bin/bash
  Rscript --slave "main.R"

TIP If you’re running on Red Hat or CentOS, you’ll need the debootstrap program: sudo yum install debootstrap. See the Singularity documentation for more information.

Assuming that the above definition file is named tensorflow.def, you can bootstrap a Singularity container image named tensorflow.img with:

sudo rm -f tensorflow.img && \
sudo singularity create --size 4000 tensorflow.img && \
sudo singularity bootstrap tensorflow.img tensorflow.def

The %post section of the definition file installs R, Python, Tensorflow and miscellaneous utilities into the container. The %test section runs the “hello world” program as an example to verify things are working. The %run section of this example simply runs an arbitrary user R program named main.R in the container’s working directory.

Run the “hello world” %test script with:

singularity test tensorflow.img

I love Singularity’s ability to include unit tests in container definition files – it reminds me of building R packages! I encourage using the test section judiciously to confirm that the container will work as intended.

You can run an arbitrary R program in the container by creating a main.R file in the container working directory and running:

singularity run tensorflow.img

Full-genome variant Principal Components

The previous example illustrated a complex tool chain, but only running on a single computer. This example is closer to a complete distributed R application.

Genomic variants record differences in a genome relative to a reference. Many types of differences exist, see for instance https://en.wikipedia.org/wiki/Structural_variation. This example focuses on differences among the 2,504 whole human genomes curated by the 1000 Genomes Project (see: “A global reference for human genetic variation”, The 1000 Genomes Project Consortium, Nature 526, 68-74 (01 October 2015) doi:10.1038/nature15393). The example downloads whole genome data files in VCF 4.1 format. Although the 1000 Genome Project data files are used here, the example will work for any input set of VCF files (it processes all files named *.vcf.gz in the working directory).

The example constructs a sparse 2,504 row (people) by 81,271,844 column (genomic variants) R matrix from the VCF data files. The matrix entries are one if a particular variant occurs in the person, or a zero otherwise. Because not every person exhibits every variant, the matrix is very sparse with about 9.8 billion nonzero-elements, or about 2% fill-in. Rather than construct a single giant sparse matrix, the example partitions the data and saves many smaller sub-matrices each with CHUNKSIZE non-zero elements as R data files in the working directory, where CHUNKSIZE is an optional user-defined parameter that defaults to a value based on system memory size.

The example computes the first NCOMP principal components, where NCOMP is a user-specified environment variable specified by the user, of sparse genomic variant VCF files. The example is very general, requiring an arbitrary number of VCF data files as input and running on any number of computers. It uses MPI to coordinate parallel activity across computers, along with the Rmpi, doMPI, and foreach packages in R. The choice of MPI is well-suited to supercomputer deployment, and the example assumes that MPI is available along with the following assumptions:

Launched by MPI
One or more gzip-compressed variant files ending in “.vcf.gz” (the program will use all files matching this pattern)
The input variant files are split up among working directories across the worker computers – each worker will parse and process only the variant files in its local working directory
Optional CHUNKSIZE environment variable in number of variants per chunk
Optional NCOMP environment variable specifying the number of principal components to return, defaulting to 3

A successful run produces the following output:

A file ‘pca.rdata’ in serialized R format containing the largest NCOMP singular values and corresponding principal component vectors of the variant data

This example was designed for deployment with supercomputer systems in mind. See https://github.com/bwlewis/1000_genomes_examples for other implementations that don’t require MPI.

Singularity encapsulates the program logic and the external library dependency chain (MPI, etc.) required by the computation in the following definition file:

BootStrap: debootstrap
OSVersion: xenial
MirrorURL: http://archive.ubuntu.com/ubuntu/
Include: bash

%post
  sed -i 's/main/main restricted universe/g' /etc/apt/sources.list
  apt-get update

  # Install R, openmpi, misc. utilities:
  apt-get install -y libopenblas-dev r-base-core libcurl4-openssl-dev libopenmpi-dev openmpi-bin openmpi-common openmpi-doc openssh-client openssh-server libssh-dev wget vim git nano git cmake  gfortran g++ curl wget python autoconf bzip2 libtool libtool-bin
  apt-get clean

  # Install required R packages
  R --slave -e 'install.packages(c("irlba", "doMPI"), repos="https://cloud.r-project.org/")'

  # Install simple VCF parser helper
  wget https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/master/parse.c && cc -O2 parse.c && mv a.out /usr/local/bin/parsevcf && rm parse.c

  # Set up unit test
  mkdir -p /usr/local/share/R
  chmod a+rwx /usr/local/share/R
  wget https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/master/unit.R && mv unit.R /usr/local/share/R/

  # This is the main R program run by /singularity
  wget https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/master/pca-mpi.R && mv pca-mpi.R /usr/local/share/R/


%test
  #!/bin/sh
  exec Rscript --slave "/usr/local/share/R/unit.R"

%runscript
  #!/bin/bash
  Rscript --slave "/usr/local/share/R/pca-mpi.R"

Build and bootstrap a Singularity container using the variant_pca.def definition file with:

sudo rm -f variant_pca.img && \
sudo singularity create --size 4000 variant_pca.img && \
sudo singularity bootstrap variant_pca.img variant_pca.def

Unit test

The container includes a simple unit test that verifies MPI operation invoked by:

mpirun -np 4 singularity test variant_pca.img

Small example

A small, fast-running example computes principal components for the first 10,000 variants from the 1000 Genomes Project chromosomes 21 and 22 as follows:

wget https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/extra/chr21.head.vcf.gz
wget https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/extra/chr22.head.vcf.gz
LANG=C CHUNKSIZE=10000000 mpirun -x LANG -x CHUNKSIZE -np 2 singularity run -H $(pwd) variant_pca.img

Read the output pca.rdata file from R using readRDS(). The following code plots the first three estimated principal components.

x <- readRDS('pca.rdata')
library(lattice)
splom(x$v)

We see some obvious clusters in the data, but the clusters are not all that well-defined because we only use data from two smaller chromosomes (21 and 22) in this example. The clusters correspond to distinct genetic superpopulations. See the following example for a refined plot using the whole genomes.

Full-sized example

Finally, compute the whole genome principal components across all chromosomes and all 2,504 people in the 1000 Genomes project with:

# Remove small example files if they exist
rm -f chr21.head.vcf.gz chr22.head.vcf.gz

# Download the variant files
j=1
while test $j -lt 23; do
  wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${j}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz &
  j=$(( $j + 1 ))
done
wait

When running on more than one computer, first distribute the vcf.gz files by scattering them across working directories on each computer. Each computer will only process the files located in its working directory, so copy a subset of the files to each computer.

The Singularity container image must also be available to run on each computer, so copy the image to each one.

Now scatter the *.vcf.gz files across your MPI computers, for instance using scp. Let’s assume for this example that we have four total computers. Then we need to invoke the program on 4 + 1 = 5 total MPI hosts, as outlined in https://cran.r-project.org/web/packages/doMPI/vignettes/doMPI.pdf (the first listed host will operate as the R master program in a master/slave configuration).

Assume that our four host computers are listed in a comma-separated list by the environment variable HOSTS, for instance by

HOSTS=10.0.0.1,10.0.0.1,10.0.0.2,10.0.0.3,10.0.0.4

Then a typical openmpi invocation is (for our four hosts):

LANG=C CHUNKSIZE=10000000 mpirun -wd $(pwd) -x LANG -x CHUNKSIZE -np 5 -host $(HOSTS) singularity run -H $(pwd) variant_pca.img

Replace the host list and -np 5 with the number of computers available in your cluster plus one. Or, submit the job using an available cluster job manager like Slurm. See https://cran.r-project.org/web/packages/doMPI/vignettes/doMPI.pdf for more details on using MPI with R.

Example output

To give you an idea of performance, I ran this example on four Amazon EC2 r4-4xlarge instances. The parsing step completed in about 20 minutes, and principal component computation took about 11 minutes (680 seconds).

As with the small example above, we can read the output file and plot the principal components:

x <- readRDS('pca.rdata')
library(lattice)
splom(x$v)

The resulting clusters are much more highly defined, and split into four or five very well-defined data clusters, corresponding almost exactly to the NIH superpopulation categories for each person. Some of the data clusters themselves exhibit sub-cluster structure.

Additional Notes

The computation uses an R program downloaded from https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/master/pca-mpi.R that we don’t reproduce here. See that file and https://github.com/bwlewis/1000_genomes_examples/blob/master/PCA_whole_genome.Rmd for additional notes.

The computation proceeds in two sequential phases, first processing the raw VCF files into chunks of sparse R matrices corresponding to the variant data, and then computing principal components on the R matrices. Parallel computation is used within each phase.

Sparse matrix chunk size is specified by the user with the environment variable CHUNKSIZE to indicate the maximum number of nonzero matrix elements per chunk. If unspecified, CHUNKSIZE is automatically determined based on a heuristic using the host computer’s memory size.

The first processing phase of the computation stores the R sparse matrix chunks corresponding to the input available VCF files for re-use iteratively by the algorithm. In particular, this algorithm process the chunked VCF data out of core – alternative versions of the program pin sparse matrix chunks in memory on each computer and avoid intermediate file system use. That can be obviously more efficient than using a file system. But, importantly, the file system approach scales easily. In particular, this program will run (slowly) on a single laptop even if the total variant sparse matrix size vastly exceeds available RAM size. Thus, this example trades best performance for flexibility. Despite this trade off, performance can be excellent in the example, thanks to the efficient algorithm used and the fact that files are cached in each computer’s buffer cache if memory permits.