In a previous post I described the role of analytic administrator as a data scientist who: onboards new tools, deploys solutions, supports existing standards, and trains other data scientists. In this post I will describe how someone in that role might set up a data science lab for R.
Architecture
A data science lab is an environment for developing code and creating content. It should enhance the productivity of your data scientists and integrate with your existing systems. Your data science lab might live on your premises or in the cloud. It might be built with hardware, virtual machines, or containers. You may use it to support a single data scientist or hundreds of R developers. Here is one reference architecture of a data science lab based on server instances.
Key components of this setup include: authentication; load balancing; a testing environment; data connectivity; and a publishing platform. In this server-based architecture, data scientists use a web browser to access the data science lab. High performance compute and live data reside securely behind a firewall.
Instance Sizing
The size of your server instance depends on how many concurrent sessions you run and how large your sessions are. Keep in mind that R is single threaded by default and holds data in memory. Here is a list of example server sizes:
Instance Size | Cores | RAM | Description |
---|---|---|---|
Minimum recommended |
2 |
4G |
This server will be for lightweight jobs, testing, and sandboxing. |
Small |
4 |
8G |
This server will support one or two analysts with small data. |
Large |
16 |
256G |
This server will support 15 analysts with a blend of large and small sessions. Alternatively, it will support dozens of analysts with small sessions. |
Jumbo |
32+ |
1T+ |
May be useful for heavier workloads. |
Open-Source R
If you haven’t done so already, I recommend you make R a legitimate part of your organization by officially recognizing it as an analytic standard. You should be familiar with installing and managing R and its packages.
You can install R as a pre-compiled binary from a repository, or you can install R from source. Installing R from source allows you to install multiple versions of R side by side. If you compile R from source, I recommend you link to the BLAS libraries so that you can speed up certain low-level math computations.
Data science labs tend to require a modern toolkit. You should expect to upgrade R at least once a year. You should also keep your operating system up to date. New and improved R packages tend to work better when you use them with recent versions of R and updated system libraries.
RStudio Server Pro
Building a data science lab involves installing, configuring, and managing tools. In this section I will describe how to administer RStudio Server Pro which has features for authentication, security, and admin controls.
1. Installation
Once you have installed R, you can install RStudio Server Pro by downloading the binaries and following the instructions. You will need root privileges to install and run the software. You will also need to create local system accounts for all of your R developers.
2. Configuration
Authentication. The first thing you will want to do after you install RStudio Server Pro is to configure it with your authentication system. RStudio Server Pro supports LDAP via PAM sessions. If you use single sign on or another system, you can configure RStudio Server Pro to work in proxied auth mode. You can also authenticate via Google accounts and local system accounts.
Data Connectivity. Most data scientists use R with databases. The RStudio Pro Drivers are ODBC drivers that will connect R to some of the most popular databases today. These drivers are a free add-on for RStudio Server Pro. If you are using a data source that is not supported, or if you are using the open source version of RStudio Server, you can bring your own ODBC driver.
Load Balancing. If you want to load balance your server instances, you can use the load balancer that is built into RStudio Server Pro or you can bring your own load balancer. Load balancing is designed to balance user sessions seamlessly across the cluster and provide high availability. It requires a shared home drive that is mounted to each one of the instances.
More Features. RStudio Server Pro has a list of features that you can configure. You should decide which features you want to enable or disable. For more information on configuring each of these features, see the admin guide.
Feature | Description |
---|---|
Authentication |
|
Data Connectivity |
|
Load Balancing |
|
Enhanced security |
|
Administrative dashboard |
|
Auditing and monitoring |
|
Advanced R session management |
|
Project sharing |
|
3. Management
Once RStudio Server Pro is installed and configured, you’ll need to manage it over time. RStudio Server Pro comes with a variety of tools for workspace and server management that will help keep your environment organized. For example, you can kill sessions, set session timeouts, and broadcast notifications to user sessions in real-time. You can manage product licenses for both online and offline environments. If your instances start and stop frequently you can opt for using a floating license manager.
Next Steps
Your data science lab for R should be designed to scale. That might mean adding more people, more systems, or more tools. It also might mean creating more content. Shiny is an R package that makes it easy to build interactive web apps straight from R. R Markdown is an R package that makes it easy to author reports and build dashboards. You can publish your Shiny apps or R Markdown reports with the push of a button to RStudio Connect. RStudio Connect lets you share and manage content in one convenient place. You can also publish Shiny apps to shinyapps.io, which allows you to share your Shiny apps online.
You may leave a comment below or discuss the post in the forum community.rstudio.com.