R for the Enterprise on R Views

Productionizing Shiny and Plumber with Pins

Thu, 17 Oct 2019 00:00:00 +0000

Producing an API that serves model results or a Shiny app that displays the results of an analysis requires a collection of intermediate datasets and model objects, all of which need to be saved. Depending on the project, they might need to be reused in another project later, shared with a colleague, used to shortcut computationally intensive steps, or safely stored for QA and auditing.

Some of these should be saved in a data warehouse, data lake, or database, but write access to an appropriate database isn’t always available. In other cases, especially with models, it may not be clear where they should be saved at all.

Enter pins, a new R package written by Javier Luraschi. pins makes it easy to save (pin) R objects including datasets, models, and plots to a central location (board), and access them easily from both R and Python. Pins make it much easier to create production-ready R assets by simplifying the storage and updating of intermediate data artifacts.

Problems you can put a pin in

In general, pins are a good substitute for saving objects alongside analysis code as .csv or .rds objects. Especially when the object is reused several times or updated independently from the rest of the analysis, a pin is probably a better solution than saving a file with your code.

In this article, I’ll create a predictive model, programmatically serve predictions via a Plumber API, and visualize those predictions in a Shiny app. Along the way, I’ll make extensive use of pins for important parts of my workflow.

The model will predict future availability of bicycles at Capital Bikeshare docks, which provide short-term bicycle rentals in and around Washington DC. Capital Bikeshare makes data on the current availability of bikes at each station available via a public API.

I’m going to make model predictions available in production by providing programmatic access to the model via an API and to humans via a Shiny app. All of the code for this demo is available on Github.

To get there, I’m going to follow this analysis workflow:

Ingest metadata about the stations, like name and location, from the bike data API.
Combine the station metadata with raw data on bike availability from the data lake to create an analysis dataset.
Train and deploy a model of future bike availability.
Serve model predictions via a Plumber API.
Visualize model predictions via a Shiny app.

Along the way, here are three specific times that a pin is going to come in handy:

Maintaining the metadata table of station IDs and details. Especially since I’m reusing this table in multiple assets in this project, having it in a pin is a sure way to know it’s up-to-date.
Saving the final analysis dataset. In this case, the raw Capitol Bikeshare data is being imported with a completely separate ETL script, and I don’t want to write my analysis dataset into a data lake. Without a separate database for analysis data, a pin is my best option.
Deploying the model to serve the predictions. Saving the model separately from the API makes it easy to decouple API and model versions and to retrain the model and redeploy seamlessly when needed.

In all of these cases, pins drastically simplify my workflow, improve discoverability of the objects my analysis has created, and makes me more confident that I’m always using the newest version.

Where to pin?

Before getting started describing exactly how this analysis project works, let’s dive a little deeper into the pins package itself.

Pins live on boards. A board is a set of content names and the associated files. The magic of the pins package is that with only two commands and the name of some content, you can upload and download your R objects without having to worry about how how the content is stored.

By default, there are two boards you can use immediately: the packages board of the datasets from R packages that are installed, and the local board, which caches datasets for quick loading later.

The real power of pins is unlocked with remote boards. pins supports Kaggle, Github, website, and RStudio Connect boards, and also supports building custom extensions. By using a remote board, you can use pins to make your R objects accessible to others on your team in a central location.

How it works

Using a pin works like this:

Register the board with the the pins::board_register function. You’ll need to provide the proper authentication mechanism like a Kaggle token, Github Personal Access Token (PAT), or RStudio Connect API key if you are using a remote board.

For GitHub, you need a repo that you have write access to, as well as a token:

pins::board_register(board = "github", 
                     repo = "akgold/pins_demo", 
                     branch = "master",
                     token = Sys.getenv("GITHUB_PAT"))

For an RStudio Connect board, you need the server URL and an API key:

pins::board_register(board = "rsconnect", 
                     server = "https://colorado.rstudio.com/rsc", 
                     key = Sys.getenv("RSTUDIOCONNECT_API_KEY"))

At that point, your connections pane in RStudio will show the content available in the board.

Once you’ve registered the board, your interactions are exactly the same no matter which board type you’re using.

Pin an object to the board.

pins::pin(
  x = mtcars, 
  name = "mtcars_pin", 
  description = "A pin of the mtcars dataset.", 
  board = "rsconnect"
)

Download the object later.

cars_data <- pins::pin_get(
  name = "mtcars_pin"
  board = "rsconnect"
)

Production Apps with Pins

In order to create, serve, and visualize my bike-availability predictions, I’m going to use RStudio’s publishing and scheduling platform, RStudio Connect.

As of RStudio Connect 1.7.8, you can publish pins to RStudio Connect, and pins of datasets provide a nice preview of the pin, as well as code to retrieve the pin in both R and Python.

The advantage of using RStudio Connect is that I can deploy R Markdown documents, Shiny apps, and Plumber APIs that create, use, and update the pins in addition to storing the pins themselves. I can also use the permissions and security of RStudio Connect to make sure that my pins are viewable only by those with the proper permissions.

Here’s how the process works:

4.

5.

The three times that a pin was useful here turn out to represent three of the most compelling reasons to use a pin.

A small dataset that gets reused. By accessing the station metadata dataset in a pin, I know I’m always getting the latest version regardless of which asset is using it, and it’s also accessible for other analyses in the future.

An analysis dataset when you can’t write back to the database. In this case, I don’t want to write an analysis dataset back to the raw data lake, so it’s easier to store it as a pin.

A model in production. By using a pin to store my model, it’s easy to update the version that’s in production by running the R Markdown document that trains the model. It’s also conceptually simple to update the model independently from the API that serves predictions or the Shiny app that visualizes the predictions.

Pins can be a fantastic way to enable Shiny and Plumber in production. By giving data scientists a place to save and deploy the output of their projects, pins make it easier to create, deploy, and update models, datasets, and other production-ready R objects.

How to Send Custom E-mails with R

Wed, 04 Sep 2019 00:00:00 +0000

A common business oriented data science task is to programatically craft and send custom emails. In this post, I will show how to accomplish this with R on the RStudio Connect platform (a paid product built for the enterprise) using the blastula package.blastula provides a set of functions for composing high-quality HTML e-mails that render across various e-mail clients, such as gmail and outlook, and also includes tooling for sending out those e-mails via SMTP, the standard protocol for electronic mail transmission between different e-mail providers. At the bottom of the post you can find a link to documentation showing how to publish email with blastula via an SMTP server without using RStudio Connect.

As an example, we’ll pretend that I work in a marketing analytics department at an insurance company. I’m responsible for a marketing report, created with the rmarkdown package, that tracks the number of bound policies from different marketing activities:

## Warning: package 'tibble' was built under R version 3.5.2

## Warning: package 'knitr' was built under R version 3.5.2

Mktg_Activity	Policies	Target
Partnerships	345	320
E-mail Mktg	434	410
Direct Mail	240	235
Radio	128	100

Having written the report with R Markdown, I will publish the script to RStudio Connect and have Connect create and send an e-mail for me. Once this is done, I’ll turn on both the scheduler and Send email after update options to have Connect re-run the report on a set schedule. By default, the e-mail generated by RStudio Connect looks something like this:

Because we haven’t done anything to customize the e-mail notification yet, Connect generated a standard out-of-the-box e-mail. It used the published document name for the e-mail subject, included a link to the report, as well as the time stamp of when it was executed. The e-mail also contains the actual report as an attachment, which can be downloaded and viewed. This is already useful, but in this case I’d like to customize the e-mail to make it more tailored to fit my team’s needs. This is where the blastula package comes in.

blastula allows you to create and send e-mails using R. It works similarly to the Shiny package, but instead of writing R code to create an interactive application, you write R code to create an HTML e-mail that can be rendered across a wide variety of e-mail providers. Once you’ve programmatically created an HTML e-mail, bastula can also be used to send out that e-mail programmatically.

To create your custom e-mail, simply add a new R code chunk at the bottom of your R Markdown script. You can use the code chunk option include = FALSE so that your R code isn’t printed in your actual RMardown report, since that wouldn’t be very helpful to whomever is reading the report:

# load the blastula package
library("blastula")

# create a simple e-mail
email <- compose_email(body = "Insert your e-mail body here",
                       footer = "Insert your e-mail footer here")

# preview e-mail in Viewer pane
preview_email(email)

blastula supports string interpolation, meaning it can display the value of an R variable rather than simply printing your R code as plain text. The way to tell blastula what is R code vs. what’s plain text, is by adding curly braces around anything you want to be interpreted as R:

# create an e-mail with R code

email_body <- 
"
Hi! This new report was generated at {Sys.time()}
"

email_footer <- 
"
Please contact *support@acme.com* with any questions
"

email <- compose_email(body = email_body,
                       footer = email_footer)

# preview e-mail in Viewer pane
preview_email(email)

You’ll notice that not only can you include R code, but you can also supply Markdown syntax (notice how we italicized some of the footer text). In addition to all this, you can use helper functions included in the blastula package to add other elements to your e-mail. For example, you can add a ggplot2 plot as an image using the add_ggplot function:

# create an e-mail with a plot
library(ggplot2)
plot <- ggplot(tb, aes(Mktg_Activity, Policies)) + geom_bar(stat = "identity")


email_body <- 
"
Hi! This new report was generated at {Sys.time()} \\


{add_ggplot(plot, width = 5, height = 3)}

"

email_footer <- 
"
Please contact *support@acme.com* with any questions
"

email <- compose_email(body = email_body,
                       footer = email_footer)

# preview e-mail in Viewer pane
preview_email(email)

Now that we’ve created the e-mail programmatically, the next step is to send it out. Because my company uses RStudio Connect to host reports and send e-mail notifications, I need to add the following two lines of code to the bottom of my report so that RStudio Connect knows what to use as the e-mail body:

# Use Blastula's message as the email body in RStudio Connect.
rmarkdown::output_metadata$set(rsc_email_body_html = email$html_str)
rmarkdown::output_metadata$set(rsc_email_images = email$images)

Before this point, the blastula code we’d written generated a nice HTML e-mail, which in this case had been saved to a variable I called email. However Connect didn’t know that. To remedy this, we saved the e-mail we created to the output_metadata object. The output_metadata object contains “metadata”, or information, some of which Connect uses. Two of those items are rsc_email_body_html and rsc_email_images, which Connect uses to build the HTML notification e-mail it sends out. For consistency, you can always assign both of these items at the end of your R Markdown report, even if initially it does not contain embedded images.

If you do not wish to use RStudio Connect to send messages, you can also use the smtp_send() function to send your e-mail via a SMTP server. For instructions, check out the package’s “Sending Email Using SMTP” vignette on github. To learn more about crafting custom e-mails, check out the blastula documentation and the RStudio Connect User Guide.

Plumber Logging

Tue, 13 Aug 2019 00:00:00 +0000

The plumber R package is used to expose R functions as API endpoints. Due to plumber’s incredible flexibility, most major API design decisions are left up to the developer. One important consideration to be made when developing APIs is how to log information about API requests and responses. This information can be used to determine how plumber APIs are performing and how they are being utilized.

An example of logging API requests in plumber is included in the package documentation. That example uses a filter to log information about incoming requests before a response has been generated. This is certainly a valid approach, but it means that the log cannot contain details about the response since the response hasn’t been created yet. In this post we will look at an alternative approach to logging plumber APIs that uses preroute and postroute hooks to log information about each API request and its associated response.

Logging

In this example, we will use the logger package to generate the actual log entries. Using this package isn’t required, but it does provide some convenient functionality that we will take advantage of.

Since we will be registering hooks for our API, we will need both a plumber.R file and an entrypoint.R file. The plumber.R file contains the following:

# plumber.R
# A simple API to illustrate logging with Plumber

library(plumber)

#* @apiTitle Logging Example

#* @apiDescription Simple example API for implementing logging with Plumber

#* Echo back the input
#* @param msg The message to echo
#* @get /echo
function(msg = "") {
  list(msg = paste0("The message is: '", msg, "'"))
}

#* Plot a histogram
#* @png
#* @get /plot
function() {
  rand <- rnorm(100)
  hist(rand)
}

Now that we’ve defined two endpoints (/echo and /plot), we can use entrypoint.R to setup logging using preroute and postroute hooks. First, we need to configure the logger package:

# entrypoint.R
library(plumber)

# logging
library(logger)

# Specify how logs are written
log_dir <- "logs"
if (!fs::dir_exists(log_dir)) fs::dir_create(log_dir)
log_appender(appender_tee(tempfile("plumber_", log_dir, ".log")))

The log_appender() function is used to specify which appender method is used for logging. Here we use appender_tee() so that logs will be written to stdout and to a specific file path. We create a directory called logs/ in the current working directory to store the resulting logs. Every log file is assigned a unique name using tempfile(). This prevents errors that can occur if concurrent processes try to write to the same file.

Now, we need to create a helper function that we will use when creating log entries:

convert_empty <- function(string) {
  if (string == "") {
    "-"
  } else {
    string
  }
}

This function takes an empty string and converts it into a dash ("-"). We will use this to ensure that empty log values still get recorded so that it is easy to read the log files. We’re now ready to create our plumber router and register the hooks necessary for logging:

pr <- plumb("plumber.R")

pr$registerHooks(
  list(
    preroute = function() {
      # Start timer for log info
      tictoc::tic()
    },
    postroute = function(req, res) {
      end <- tictoc::toc(quiet = TRUE)
      # Log details about the request and the response
      log_info('{convert_empty(req$REMOTE_ADDR)} "{convert_empty(req$HTTP_USER_AGENT)}" {convert_empty(req$HTTP_HOST)} {convert_empty(req$REQUEST_METHOD)} {convert_empty(req$PATH_INFO)} {convert_empty(res$status)} {round(end$toc - end$tic, digits = getOption("digits", 5))}')
    }
  )
)

pr

We use the $registerHooks() method to register both preroute and postroute hooks. The preroute hook uses the tictoc package to start a timer. The postroute hook stops the timer and then writes a log entry using the log_info() function from the logger package. Each log entry contains the following information:

Log level: This is a distinction made by the logger package, and in this example the value is always INFO
Timestamp: The timestamp for when the response was generated and sent back to the client
Remote Address: The address of the client making the request
User Agent: The user agent making the request
Http Host: The host of the API
Method: The HTTP method attached to the request
Path: The specific API endpoint requested
Status: The HTTP status of the response
Execution Time: The amount of time from when the request received until the response was generated

This log format is loosely inspired by the NCSA Common log format.

Testing

Now that our API is all setup, it’s time to test to make sure logging works as expected. First, we need to start the API. The easiest way to do this is to click the Run API button that appears at the top of the plumber.R file in the RStudio IDE. Once the API is running, you’ll see a message in the console similar to the following:

Running plumber API at http://127.0.0.1:5762

Now that we know the API is running, we need to make a request. One of the easiest ways to make a request in this case is to open a web browser (like Google Chrome) and type the API address in the address bar followed by /plot. In this example, I would type http://127.0.0.1:5762/plot into the address bar of my browser. If all goes well, you should see a plot rendered in the browser. The RStudio console will display the log output:

INFO [2019-08-09 12:30:23] 127.0.0.1 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36" localhost:5762 GET /plot 200 0.158

A new logs/ directory will have been created in the current working directory and it will contain a file with the log entry. You can generate more log entries by refreshing your browser window.

Analyzing

Let’s say that we refreshed the browser window 1,000 times. The log file generated will contain an entry for each request. We can analyze this log file to find helpful information about the API. For example, we could plot a histogram of execution time:

library(ggplot2)

plumber_log <- readr::read_log("logs/plumber_fe3daed895d.log",
                               col_names = c("log_level",
                                             "timestamp",
                                             "remote_address",
                                             "user_agent",
                                             "http_host",
                                             "method",
                                             "path",
                                             "status",
                                             "execution_time"))

ggplot(plumber_log, aes(x = execution_time)) +
  geom_histogram() +
  theme_bw() +
  labs(title = "Execution Times",
       x = "Execution Time")

We could even build a Shiny application to monitor the logs/ directory and provide real-time visibility into API metrics!

The details of this Shiny application go beyond the scope of this post, but the source code is available here.

Plumber APIs published to RStudio Connect can use this pattern to log and monitor API requests. Details on this use case can be found in this repository

Conclusion

Plumber is an incredibly flexible package for exposing R functions as API endpoints. Logging information about API requests and responses provides visibility into API usage and performance. These log files can be manually inspected or used in connection with other tools (like Shiny) to provide real-time metrics around API use. The code used in this example along with additional information is available in this GitHub repository.

If you are interested in learning more about using plumber, logging, Shiny, and RStudio Connect, please visit community.rstudio.com and let us know!

James Blair is a solutions engineer at RStudio who focuses on tools, technologies, and best practices for using R in the enterprise.

Three Strategies for Working with Big Data in R

Wed, 17 Jul 2019 00:00:00 +0000

For many R users, it’s obvious why you’d want to use R with big data, but not so obvious how. In fact, many people (wrongly) believe that R just doesn’t work very well for big data.

In this article, I’ll share three strategies for thinking about how to use big data in R, as well as some examples of how to execute each of them.

By default R runs only on data that can fit into your computer’s memory. Hardware advances have made this less of a problem for many users since these days, most laptops come with at least 4-8Gb of memory, and you can get instances on any major cloud provider with terabytes of RAM. But this is still a real problem for almost any data set that could really be called big data.

The fact that R runs on in-memory data is the biggest issue that you face when trying to use Big Data in R. The data has to fit into the RAM on your machine, and it’s not even 1:1. Because you’re actually doing something with the data, a good rule of thumb is that your machine needs 2-3x the RAM of the size of your data.

An other big issue for doing Big Data work in R is that data transfer speeds are extremely slow relative to the time it takes to actually do data processing once the data has transferred. For example, the time it takes to make a call over the internet from San Francisco to New York City takes over 4 times longer than reading from a standard hard drive and over 200 times longer than reading from a solid state hard drive.¹ This is an especially big problem early in developing a model or analytical project, when data might have to be pulled repeatedly.

Nevertheless, there are effective methods for working with big data in R. In this post, I’ll share three strategies. And, it important to note that these strategies aren’t mutually exclusive – they can be combined as you see fit!

Strategy 1: Sample and Model

To sample and model, you downsample your data to a size that can be easily downloaded in its entirety and create a model on the sample. Downsampling to thousands – or even hundreds of thousands – of data points can make model runtimes feasible while also maintaining statistical validity.²

If maintaining class balance is necessary (or one class needs to be over/under-sampled), it’s reasonably simple stratify the data set during sampling.

Illustration of Sample and Model

Advantages

Speed Relative to working on your entire data set, working on just a sample can drastically decrease run times and increase iteration speed.
Prototyping Even if you’ll eventually have to run your model on the entire data set, this can be a good way to refine hyperparameters and do feature engineering for your model.
Packages Since you’re working on a normal in-memory data set, you can use all your favorite R packages.

Disadvantages

Sampling Downsampling isn’t terribly difficult, but does need to be done with care to ensure that the sample is valid and that you’ve pulled enough points from the original data set.
Scaling If you’re using sample and model to prototype something that will later be run on the full data set, you’ll need to have a strategy (such as pushing compute to the data) for scaling your prototype version back to the full data set.
Totals Business Intelligence (BI) tasks frequently answer questions about totals, like the count of all sales in a month. One of the other strategies is usually a better fit in this case.

Strategy 2: Chunk and Pull

In this strategy, the data is chunked into separable units and each chunk is pulled separately and operated on serially, in parallel, or after recombining. This strategy is conceptually similar to the MapReduce algorithm. Depending on the task at hand, the chunks might be time periods, geographic units, or logical like separate businesses, departments, products, or customer segments.

Chunk and Pull Illustration

Advantages

Full data set The entire data set gets used.
Parallelization If the chunks are run separately, the problem is easy to treat as embarassingly parallel and make use of parallelization to speed runtimes.

Disadvantages

Need Chunks Your data needs to have separable chunks for chunk and pull to be appropriate.
Pull All Data Eventually have to pull in all data, which may still be very time and memory intensive.
Stale Data The data may require periodic refreshes from the database to stay up-to-date since you’re saving a version on your local machine.

Strategy 3: Push Compute to Data

In this strategy, the data is compressed on the database, and only the compressed data set is moved out of the database into R. It is often possible to obtain significant speedups simply by doing summarization or filtering in the database before pulling the data into R.

Sometimes, more complex operations are also possible, including computing histogram and raster maps with dbplot, building a model with modeldb, and generating predictions from machine learning models with tidypredict.

Push Compute to Data Illustration

Advantages

Use the Database Takes advantage of what databases are often best at: quickly summarizing and filtering data based on a query.
More Info, Less Transfer By compressing before pulling data back to R, the entire data set gets used, but transfer times are far less than moving the entire data set.

Disadvantages

Database Operations Depending on what database you’re using, some operations might not be supported.
Database Speed In some contexts, the limiting factor for data analysis is the speed of the database itself, and so pushing more work onto the database is the last thing analysts want to do.

An Example

I’ve preloaded the flights data set from the nycflights13 package into a PostgreSQL database, which I’ll use for these examples.

Let’s start by connecting to the database. I’m using a config file here to connect to the database, one of RStudio’s recommended database connection methods:

library(DBI)
library(dplyr)
library(ggplot2)

db <- DBI::dbConnect(
  odbc::odbc(),
  Driver = config$driver,
  Server = config$server,
  Port = config$port,
  Database = config$database,
  UID = config$uid,
  PWD = config$pwd,
  BoolsAsChar = ""
)

The dplyr package is a great tool for interacting with databases, since I can write normal R code that is translated into SQL on the backend. I could also use the DBI package to send queries directly, or a SQL chunk in the R Markdown document.

df <- dplyr::tbl(db, "flights")
tally(df)

## # A tibble: 1 x 1
##        n
##    <int>
## 1 336776

With only a few hundred thousand rows, this example isn’t close to the kind of big data that really requires a Big Data strategy, but it’s rich enough to demonstrate on.

Sample and Model

Let’s say I want to model whether flights will be delayed or not. This is a great problem to sample and model.

Let’s start with some minor cleaning of the data

# Create is_delayed column in database
df <- df %>%
  mutate(
    # Create is_delayed column
    is_delayed = arr_delay > 0,
    # Get just hour (currently formatted so 6 pm = 1800)
    hour = sched_dep_time / 100
  ) %>%
  # Remove small carriers that make modeling difficult
  filter(!is.na(is_delayed) & !carrier %in% c("OO", "HA"))


df %>% count(is_delayed)

## # A tibble: 2 x 2
##   is_delayed      n
##   <lgl>       <int>
## 1 FALSE      194078
## 2 TRUE       132897

These classes are reasonably well balanced, but since I’m going to be using logistic regression, I’m going to load a perfectly balanced sample of 40,000 data points.

For most databases, random sampling methods don’t work super smoothly with R, so I can’t use dplyr::sample_n or dplyr::sample_frac. I’ll have to be a little more manual.

set.seed(1028)

# Create a modeling dataset 
df_mod <- df %>%
  # Within each class
  group_by(is_delayed) %>%
  # Assign random rank (using random and row_number from postgres)
  mutate(x = random() %>% row_number()) %>%
  ungroup()

# Take first 20K for each class for training set
df_train <- df_mod %>%
  filter(x <= 20000) %>%
  collect()

# Take next 5K for test set
df_test <- df_mod %>%
  filter(x > 20000 & x <= 25000) %>%
  collect()

# Double check I sampled right
count(df_train, is_delayed)
count(df_test, is_delayed)

## # A tibble: 2 x 2
##   is_delayed     n
##   <lgl>      <int>
## 1 FALSE      20000
## 2 TRUE       20000

## # A tibble: 2 x 2
##   is_delayed     n
##   <lgl>      <int>
## 1 FALSE       5000
## 2 TRUE        5000

Now let’s build a model – let’s see if we can predict whether there will be a delay or not by the combination of the carrier, the month of the flight, and the time of day of the flight.

mod <- glm(is_delayed ~ carrier + 
             as.character(month) + 
             poly(sched_dep_time, 3),
           family = "binomial", 
           data = df_train)

# Out-of-Sample AUROC
df_test$pred <- predict(mod, newdata = df_test)
auc <- suppressMessages(pROC::auc(df_test$is_delayed, df_test$pred))
auc

## Area under the curve: 0.6425

As you can see, this is not a great model and any modelers reading this will have many ideas of how to improve what I’ve done. But that wasn’t the point!

I built a model on a small subset of a big data set. Including sampling time, this took my laptop less than 10 seconds to run, making it easy to iterate quickly as I want to improve the model. After I’m happy with this model, I could pull down a larger sample or even the entire data set if it’s feasible, or do something with the model from the sample.

Chunk and Pull

In this case, I want to build another model of on-time arrival, but I want to do it per-carrier. This is exactly the kind of use case that’s ideal for chunk and pull. I’m going to separately pull the data in by carrier and run the model on each carrier’s data.

I’m going to start by just getting the complete list of the carriers.

# Get all unique carriers
carriers <- df %>% 
  select(carrier) %>% 
  distinct() %>% 
  pull(carrier)

Now, I’ll write a function that

takes the name of a carrier as input
pulls the data for that carrier into R
splits the data into training and test
trains the model
outputs the out-of-sample AUROC (a common measure of model quality)

carrier_model <- function(carrier_name) {
  # Pull a chunk of data
  df_mod <- df %>%
    dplyr::filter(carrier == carrier_name) %>%
    collect()
  
  # Split into training and test
  split <- df_mod %>%
    rsample::initial_split(prop = 0.9, strata = "is_delayed") %>% 
    suppressMessages()
  
  # Get training data
  df_train <- split %>% rsample::training()
  
  # Train model
  mod <- glm(is_delayed ~ as.character(month) + poly(sched_dep_time, 3),
             family = "binomial",
             data = df_train)
  
  # Get out-of-sample AUROC
  df_test <- split %>% rsample::testing()
  df_test$pred <- predict(mod, newdata = df_test)
  suppressMessages(auc <- pROC::auc(df_test$is_delayed ~ df_test$pred))
  
  auc
}

Now, I’m going to actually run the carrier model function across each of the carriers. This code runs pretty quickly, and so I don’t think the overhead of parallelization would be worth it. But if I wanted to, I would replace the lapply call below with a parallel backend.³

set.seed(98765)
mods <- lapply(carriers, carrier_model) %>%
  suppressMessages()

names(mods) <- carriers

Let’s look at the results.

mods

## $UA
## Area under the curve: 0.6408
## 
## $AA
## Area under the curve: 0.6041
## 
## $B6
## Area under the curve: 0.6475
## 
## $DL
## Area under the curve: 0.6162
## 
## $EV
## Area under the curve: 0.6419
## 
## $MQ
## Area under the curve: 0.5973
## 
## $US
## Area under the curve: 0.6096
## 
## $WN
## Area under the curve: 0.6968
## 
## $VX
## Area under the curve: 0.6969
## 
## $FL
## Area under the curve: 0.6347
## 
## $AS
## Area under the curve: 0.6906
## 
## $`9E`
## Area under the curve: 0.6071
## 
## $F9
## Area under the curve: 0.625
## 
## $YV
## Area under the curve: 0.7029

So these models (again) are a little better than random chance. The point was that we utilized the chunk and pull strategy to pull the data separately by logical units and building a model on each chunk.

Push Compute to the Data

In this case, I’m doing a pretty simple BI task - plotting the proportion of flights that are late by the hour of departure and the airline.

Just by way of comparison, let’s run this first the naive way – pulling all the data to my system and then doing my data manipulation to plot.

system.time(
  df_plot <- df %>%
    collect() %>%
    # Change is_delayed to numeric
    mutate(is_delayed = ifelse(is_delayed, 1, 0)) %>%
    group_by(carrier, sched_dep_time) %>%
    # Get proportion per carrier-time
    summarize(delay_pct = mean(is_delayed, na.rm = TRUE)) %>%
    ungroup() %>%
    # Change string times into actual times
    mutate(sched_dep_time = stringr::str_pad(sched_dep_time, 4, "left", "0") %>% 
             strptime("%H%M") %>% 
             as.POSIXct())) -> timing1

Now that wasn’t too bad, just 2.366 seconds on my laptop.

But let’s see how much of a speedup we can get from chunk and pull. The conceptual change here is significant - I’m doing as much work as possible on the Postgres server now instead of locally. But using dplyr means that the code change is minimal. The only difference in the code is that the collect call got moved down by a few lines (to below ungroup()).

system.time(
  df_plot <- df %>%
    # Change is_delayed to numeric
    mutate(is_delayed = ifelse(is_delayed, 1, 0)) %>%
    group_by(carrier, sched_dep_time) %>%
    # Get proportion per carrier-time
    summarize(delay_pct = mean(is_delayed, na.rm = TRUE)) %>%
    ungroup() %>%
    collect() %>%
    # Change string times into actual times
    mutate(sched_dep_time = stringr::str_pad(sched_dep_time, 4, "left", "0") %>% 
             strptime("%H%M") %>% 
             as.POSIXct())) -> timing2

It might have taken you the same time to read this code as the last chunk, but this took only 0.269 seconds to run, almost an order of magnitude faster!⁴ That’s pretty good for just moving one line of code.

Now that we’ve done a speed comparison, we can create the nice plot we all came for.

df_plot %>%
  mutate(carrier = paste0("Carrier: ", carrier)) %>%
  ggplot(aes(x = sched_dep_time, y = delay_pct)) +
  geom_line() +
  facet_wrap("carrier") +
  ylab("Proportion of Flights Delayed") +
  xlab("Time of Day") +
  scale_y_continuous(labels = scales::percent) +
  scale_x_datetime(date_breaks = "4 hours", 
                   date_labels = "%H")

It looks to me like flights later in the day might be a little more likely to experience delays, but that’s a question for another blog post.

https://blog.codinghorror.com/the-infinite-space-between-words/↩
This isn’t just a general heuristic. You’ll probably remember that the error in many statistical processes is determined by a factor of $\frac{1}{n^2}$ for sample size $n$, so a lot of the statistical power in your model is driven by adding the first few thousand observations compared to the final millions.↩
One of the biggest problems when parallelizing is dealing with random number generation, which you use here to make sure that your test/training splits are reproducible. It’s not an insurmountable problem, but requires some careful thought.↩
And lest you think the real difference here is offloading computation to a more powerful database, this Postgres instance is running on a container on my laptop, so it’s got exactly the same horsepower behind it.↩

Reproducible Environments

Mon, 22 Apr 2019 00:00:00 +0000

Great data science work should be reproducible. The ability to repeat experiments is part of the foundation for all science, and reproducible work is also critical for business applications. Team collaboration, project validation, and sustainable products presuppose the ability to reproduce work over time.

In my opinion, mastering just a handful of important tools will make reproducible work in R much easier for data scientists. R users should be familiar with version control, RStudio projects, and literate programming through R Markdown. Once these tools are mastered, the major remaining challenge is creating a reproducible environment.

An environment consists of all the dependencies required to enable your code to run correctly. This includes R itself, R packages, and system dependencies. As with many programming languages, it can be challenging to manage reproducible R environments. Common issues include:

Code that used to run no longer runs, even though the code has not changed.
Being afraid to upgrade or install a new package, because it might break your code or someone else’s.
Typing install.packages in your environment doesn’t do anything, or doesn’t do the right thing.

These challenges can be addressed through a careful combination of tools and strategies. This post describes two use cases for reproducible environments:

Safely upgrading packages
Collaborating on a team

The sections below each cover a strategy to address the use case, and the necessary tools to implement each strategy. Additional use cases, strategies, and tools are presented at https://environments.rstudio.com. This website is a work in progress, but we look forward to your feedback.

Safely Upgrading Packages

Upgrading packages can be a risky affair. It is not difficult to find serious R users who have been in a situation where upgrading a package had unintended consequences. For example, the upgrade may have broken parts of their current code, or upgrading a package for one project accidentally broke the code in another project. A strategy for safely upgrading packages consists of three steps:

Isolate a project
Record the current dependencies
Upgrade packages

The first step in this strategy ensures one project’s packages and upgrades won’t interfere with any other projects. Isolating projects is accomplished by creating per-project libraries. A tool that makes this easy is the new renv package. Inside of your R project, simply use:

# inside the project directory
renv::init()

The second step is to record the current dependencies. This step is critical because it creates a safety net. If the package upgrade goes poorly, you’ll be able to revert the changes and return to the record of the working state. Again, the renv package makes this process easy.

# record the current dependencies in a file called renv.lock
renv::snapshot()

# commit the lockfile alongside your code in version control
# and use this function to view the history of your lockfile
renv::history()

# if an upgrade goes astray, revert the lockfile
renv::revert(commit = "abc123")

# and restore the previous environment
renv::restore()

With an isolated project and a safety net in place, you can now proceed to upgrade or add new packages, while remaining certain the current functional environment is still reproducible. The pak package can be used to install and upgrade packages in an interactive environment:

# upgrade packages quickly and safely
pak::pkg_install("ggplot2")

The safety net provided by the renv package relies on access to older versions of R packages. For public packages, CRAN provides these older versions in the CRAN archive. Organizations can use tools like RStudio Package Manager to make multiple versions of private packages available. The “snapshot and restore” approach can also be used to promote content to production. In fact, this approach is exactly how RStudio Connect and shinyapps.io deploy thousands of R applications to production each day!

Team Collaboration

A common challenge on teams is sharing and running code. One strategy that administrators and R users can adopt to facilitate collaboration is shared baselines. The basics of the strategy are simple:

Administrators setup a common environment for R users by installing RStudio Server.
On the server, administrators install multiple versions of R.
Each version of R is tied to a frozen repository using a Rprofile.site file.

By using a frozen repository, either administrators or users can install packages while still being sure that everyone will get the same set of packages. A frozen repository also ensures that adding new packages won’t upgrade other shared packages as a side-effect. New packages and upgrades are offered to users over time through the addition of new versions of R.

Frozen repositories can be created by manually cloning CRAN, accessing a service like MRAN, or utilizing a supported product like RStudio Package Manager.

Adaptable Strategies

The prior sections presented specific strategies for creating reproducible environments in two common cases. The same strategy may not be appropriate for every organization, R user, or situation. If you’re a student reporting an error to your professor, capturing your sessionInfo() may be all you need. In contrast, a statistician working on a clinical trial will need a robust framework for recreating their environment. Reproducibility is not binary!

To help pick between strategies, we’ve developed a strategy map. By answering two questions, you can quickly identify where your team falls on this map and identify the nearest successful strategy. The two questions are represented on the x and y-axis of the map:

Do I have any restrictions on what packages can be used?
Who is responsible for managing installed packages?

For more information on picking and using these strategies, please visit https://environments.rstudio.com. By adopting a strategy for reproducible environments, R users, administrators, and teams can solve a number of important challenges. Ultimately, reproducible work adds credibility, creating a solid foundation for research, business applications, and production systems. We are excited to be working on tools to make reproducible work in R easy and fun. We look forward to your feedback, community discussions, and future posts.

Slack and Plumber, Part Two

Tue, 27 Nov 2018 00:00:00 +0000

This is the final entry in a three-part series about the plumber package. The first post introduces plumber as an R package for building REST API endpoints in R. The second post builds a working example of a plumber API that powers a Slack slash command. In this final entry, we will secure the API created in the previous post so that it only responds to authenticated requests, and deploy it using RStudio Connect.

As a reminder, this API is built on top of simulated customer call data. The slash command we create will allow users to view a customer status report within Slack. This status report contains customer name, total calls, date of birth, and a plot of call history for the past 20 weeks. The simulated data, along with the script used to create it, can be found in the GitHub repository for this example.

Setup

Successfully following this example assumes you have created a Slack account and you have followed the instructions for creating an app. The Plumber API as it currently exists is described in detail in the previous post.

This API can be run through the UI as previously described, or by running plumber::plumb("plumber.R")$run(port = 5762) from the directory containing the API defined in plumber.R. As it stands now, this API could be deployed and used by Slack. However, it’s important to remember that we have no control over the request that Slack makes to the API. Because of this, we can’t rely on RStudio Connect’s built-in API authentication mechanism to secure the API because there is no way to submit a key with the request. Our options are either to expose the API with no security, meaning anyone can access the endpoints we’ve defined, or to find some other mechanism for securing the API so that it only responds to authorized requests.

API Security Patterns

The plumber documentation provides a good introduction to API security for the R user:

The majority of R programmers have not been trained to give much attention to the security of the code that they write. This is for good reason since running R code on your own machine with no external input gives little opportunity for attackers to leverage your R code to do anything malicious. However, as soon as you expose an API on a network, your concerns and thought process must adapt accordingly.

API security can be challenging to address. As it stands today, it is the developer’s responsibility to provide proper security on API endpoints, though in the future, there may be additional security features added to plumber or available via other R packages.

As mentioned in the plumber documentation, there are a number of things to consider when designing API security. For example, if the API is deployed on an internal network, securing the API may not be as important as it would be if the API was publicly exposed on the internet. When an API needs to be secured, there are several potential attack vectors that need to be handled. In this specific example, we are exposing a public endpoint that provides access to sensitive customer data. If we are unable to authenticate incoming requests, then we risk exposing sensitive data. To prevent this data from falling into the wrong hands, we will focus on verifying incoming requests so that the API only responds to requests made from Slack.

There are several different methods for authenticating requests made to API endpoints. One common method is the use of API keys, which are cryptographically secure values sent with the request to verify the identity of the client. However, in this case, we have no control over the request Slack sends, so we cannot include such a key in the request. Thankfully, Slack has provided an alternative authentication method using signed secrets. Full details can be read in the Slack documentation, but in essence, each Slack application is assigned a unique secret value that, when used in connection with other request details, can be used to verify that an incoming request is indeed coming from Slack and not an unknown third party.

Securing the API

In order to secure our API so that only requests from Slack are honored, we first need to obtain the signing secret for our application. This value can be found in the Basic Information section of the Slack application settings. Now, it is important to remember that this is called a signing secret for a reason: it should not be shared with anyone. To avoid exposing this secret, we can save it as an environment variable. We add this in a current R session by using Sys.setenv(SLACK_SIGNING_SECRET = <our signing secret>), or we can add it to our .Renviron file so that it is set for every R session. Once this is done, we can access this value in R using Sys.getenv("SLACK_SIGNING_SECRET"). Now we are ready to create a function to verify if incoming requests are from Slack.

Slack provides the following three-step process for verifying requests:

Your app receives a request from Slack
Your app computes a signature based on the request
You make sure the computed signature matches the signature on the request

In order to verify all incoming requests, we can define an additional filter for our API that follows the above recipe.

#* Verify incoming requests
#* @filter verify
function(req, res) {
  # Forward requests coming to swagger endpoints
  if (grepl("swagger", tolower(req$PATH_INFO))) return(forward())

  # Check for X_SLACK_REQUEST_TIMESTAMP header
  if (is.null(req$HTTP_X_SLACK_REQUEST_TIMESTAMP)) {
    res$status <- 401
  }

  # Build base string
  base_string <- paste(
    "v0",
    req$HTTP_X_SLACK_REQUEST_TIMESTAMP,
    req$postBody,
    sep = ":"
  )

  # Slack Signing secret is available as environment variable
  # SLACK_SIGNING_SECRET
  computed_request_signature <- paste0(
    "v0=",
    openssl::sha256(base_string, Sys.getenv("SLACK_SIGNING_SECRET"))
  )

  # If the computed request signature doesn't match the signature provided in the
  # request, set status of response to 401
  if (!identical(req$HTTP_X_SLACK_SIGNATURE, computed_request_signature)) {
    res$status <- 401
  } else {
    res$status <- 200
  }

  if (res$status == 401) {
    list(
      text = "Error: Invalid request"
    )
  } else {
    forward()
  }
}

There are a lot of moving pieces to this filter, but essentially we are following the process outlined by Slack for verifying requests. We also allow Swagger endpoints to be served without verification so that the Swagger UI can still be generated for our API.

Once this filter is in place, all incoming requests will be verified. However, this will create issues with our /plot/history/ endpoint since it is called using a standard GET request without any Slack authentication. To ensure that this endpoint is able to be utilized as we want, we’ll make some small updates to the endpoint and add #* @preempt verify to the plumber comments before the function. This prevents the verify filter from applying to this endpoint.

Now, this prevents the Slack authentication process from applying to our plot endpoint. However, this endpoint, if left unsecured, provides unfiltered access to sensitive customer data. We need an effective way to secure this endpoint so that it only responds to requests generated from Slack.

Since the only thing we control in the request to this endpoint is the URL, we can update our endpoint so that an encrypted parameter is passed as part of the URL. This parameter is a combination of the current datetime and the customer ID that is then encrypted using our Slack signing secret. We can use the encrypt_string() function from the safer package to securely encrypt this string. The following example illustrates this process.

current_time <- Sys.time()
customer_id <- 89
parameter_string <- paste(current_time, customer_id, sep = ";")
safer::encrypt_string(parameter_string, Sys.getenv("SLACK_SIGNING_SECRET"))

## [1] "m7NfMZfpY1n5EuivjuiFQsyKopT68HiX+NIgk5S+VBlDHrVqzRM="

Once we have created this encrypted value, we pass it to the URL of our plot endpoint. Then, within the plot endpoint, we decrypt the string, extract the customer ID, and check to see if the current time is within five seconds of the time encoded in the string. If more than five seconds have passed, we consider the request to be unauthorized. To help with this process, we define two helper functions:

encrypt_string <- function(string) {
  urltools::url_encode(safer::encrypt_string(paste(Sys.time(), string, sep = ";"),
                                             key = Sys.getenv("SLACK_SIGNING_SECRET")))
}

plot_auth <- function(endpoint, time_limit = 5) {
  # Save current time to compare against endpoint time value
  current_time <- Sys.time()

  # Try to decrypt endpoint and extract user id
  tryCatch({
    # Decrypt endpoint using SLACK_SIGNING_SECRET
    decrypted_endpoint <- safer::decrypt_string(endpoint,
                                                key = Sys.getenv("SLACK_SIGNING_SECRET"))
    # Split endpoint on ;
    endpoint_split <- unlist(strsplit(decrypted_endpoint, split = ";"))
    # Convert time
    endpoint_time <- as.POSIXct(endpoint_split[1])
    # Calculate time difference
    time_diff <- difftime(current_time, endpoint_time, units = "secs")

    # If more than 5 seconds have passed since the request was generated, then
    # error
    if (time_diff > time_limit) {
      "Unauthorized"
    } else {
      endpoint_split[2]
    }
  },
  error = function(e) "Unauthorized"
  )
}

Once these helper functions are in place, we can update our /plot/history endpoint as follows:

#* Plot customer weekly calls
#* @png
#* @param cust_secret encrypted value calculated in /status endpoint
#* @response 400 No customer with the given ID was found.
#* @preempt verify
#* @get /plot/history
function(res, cust_secret) {
  # Authenticate that request came from /status
  cust_id <- plot_auth(cust_secret)

  # Return unauthorized error if cust_id is "Unauthorized"
  if (cust_id == "Unauthorized") {
    res$status <- 401
    stop("Unauthorized request")
  } else if (!cust_id %in% sim_data$id) {
    res$status <- 400
    stop("Customer id" , cust_id, " not found.")
  }

  # Filter data to customer id provided
  plot_data <- dplyr::filter(sim_data, id == cust_id)

  # Customer name (id)
  customer_name <- paste0(unique(plot_data$name), " (", unique(plot_data$id), ")")

  # Create plot
  history_plot <- plot_data %>%
    ggplot(aes(x = time, y = calls, col = calls)) +
    ggalt::geom_lollipop(show.legend = FALSE) +
    theme_light() +
    labs(
      title = paste("Weekly calls for", customer_name),
      x = "Week",
      y = "Calls"
    )

  # print() is necessary to render plot properly
  print(history_plot)
}

Now we need to make one small change to our /status endpoint so that we properly build the appropriate URL for our image. We construct the list response to the /status endpoint as follows, where image_url has been updated.

    attachments = list(
      list(
        color = customer_status,
        title = paste0("Status update for ", customer_name, " (", customer_id, ")"),
        fallback = paste0("Status update for ", customer_name, " (", customer_id, ")"),
        # History plot

        image_url = paste0(base_url,
                           "/plot/history?cust_secret=",
                           encrypt_string(customer_id)),
        # Fields provide a way of communicating semi-tabular data in Slack
        fields = list(
          list(
            title = "Total Calls",
            value = sum(customer_data$calls),
            short = TRUE
          ),
          list(
            title = "DoB",
            value = unique(customer_data$dob),
            short = TRUE
          )
        )
      )
    )

Just like that, we have a secure API!

All Together Now

Now, given the authorization pieces we have implemented, it is a bit more difficult to test our API since our endpoints will only respond to authorized requests. However, we can use the free version of Postman to test our API. An in-depth look at the capabilities of Postman is beyond the scope of this post, so hopefully a gif will suffice. Further details about using Postman in this context can be found in the GitHub repository for this example.

It appears that everything is working as expected! Our endpoints fail when the authorization criteria are not met, and otherwise they succeed. Notice that the plot endpoint works when initially called, but when a subsequent call is made it fails since more than five seconds have passed since the /status endpoint was invoked.

Now, the final step in this process is publishing this API so that Slack can properly interact with it. The easiest way to do this is to publish the API to RStudio Connect. Once published, Slack can be updated to point the Slash command to our nice, newly secured API.

Conclusion

This brings us to the conclusion of this series. We’ve discovered the power of plumber in exposing R to downstream consumers via RESTful API endpoints. We built a Slack app powered entirely by R and plumber, and now we have secured the underlying API so that it only responds to authorized requests. As we have seen, plumber provides a powerful and flexible framework for exposing R functions as APIs. These APIs can be safely secured so that only authorized requests are permitted.

James Blair is a solutions engineer at RStudio who focuses on tools, technologies, and best practices for using R in the enterprise.

Communicating results with R Markdown

Thu, 01 Nov 2018 00:00:00 +0000

In my training as a consultant, I learned that long hours of analysis were typically followed by equally long hours of preparing for presentations. I had to turn my complex analyses into recommendations, and my success as a consultant depended on my ability to influence decision makers. I used a variety of tools to convey my insights, but over time I increasingly came to rely on R Markdown as my tool of choice. R Markdown is easy to use, allows others to reproduce my work, and has powerful features such as parameterized inputs and multiple output formats. With R Markdown, I can share more work with less effort than I did with previous tools, making me a more effective data scientist. In this post, I want to examine three commonly used communication tools and show how R Markdown is often the better choice.

Microsoft Office

The de facto tools for communication in the enterprise are still Microsoft Word, PowerPoint, and Excel. These tools, born in the 80’s and rising to prominence in the 90’s, are used everywhere for sharing reports, presentations, and dashboards. Although Microsoft Office documents are easy to share, they can be cumbersome for data scientists to write because they cannot be written with code. Additionally:

They are not reproducible.
They are separate from the code you used to create your analysis.
They can be time-consuming to create and difficult to maintain.

In data science, your code - not your report or presentation - is the source of your results. Therefore, your documents should also be based on code! You can accomplish this with R Markdown, which produces documents that are generated by code, reproducible, and easy to maintain. Moreover, R Markdown documents can be rendered in Word, PowerPoint, and many other output formats. So, even if your client insists on having Microsoft documents, by generating them with R Markdown, you can spend more time working on your code and less time maintaining reports.

R Scripts

Data science often involves interactive analyses with code, but code by itself is usually not enough to communicate results in an enterprise setting. In a previous post, I explained the benefits of using R Notebooks over R scripts for doing data science. An R Notebook is a special execution mode of R Markdown with two characteristics that make it very useful for communicating results:

Rendering a preview of an R Notebook does not execute R code, making it computationally convenient to create reports during or after interactive analyses.
R Notebooks have an embedded copy of the source code, making it convenient for others to examine your work.

These two characteristics of R Notebooks combine the advantages of R scripts with the advantages of R Markdown. Like R scripts, you can do interactive data analyses and see all your code, but unlike R scripts you can easily create reports that explain why your code is important.

Shiny

Shiny and R Markdown are both used to communicate results. They both depend on R, generate high-quality output, and can be designed to accept user inputs. In previous posts, we discussed Dashboards with Shiny and Dashboards with R Markdown. Knowing when to use Shiny and when to use R Markdown will increase your ability to influence decision makers.

Shiny Apps	R Markdown Documents
Have an interactive and responsive user experience.	Are snapshots in time, rendered in batch.
Are hosted on a web server that runs R.	Have multiple output types such as HTML, Word, PDF, and many more.
Are not portable (i.e., users must visit the app).	Are files that can be sent via email or otherwise shared.

Shiny is great – even “magical” – when you want your end users to have an interactive experience, but R Markdown documents are often simpler to program, easier to maintain, and can reach a wider audience. I use Shiny when I need an interactive user experience, but for everything else, I use R Markdown.

If you need to accept user input, but you don’t require the reactive framework of Shiny, you can add parameters to your R Markdown code. This process is easy and powerful, yet remains underutilized by most R users. It is a feature that would benefit a wide range of use cases, especially where the full power of Shiny is not required. Additionally, adding parameters to your document makes it easy to generate multiple versions of that document. If you host a document on RStudio Connect, then users can select inputs and generate new versions on demand. Many Shiny applications today would be better suited as parameterized R Markdown documents.

Finally, Shiny and R Markdown are not mutually exclusive. You can include Shiny elements in an R Markdown document, which enables you create a report that responds interactively to user inputs. These Shiny documents are created with the simplicity of R markdown, but have the same hosting requirements of a Shiny app and are not portable.

Summary

Using the right tools for communication matters. R Markdown is a better solution than conventional tools for the following problems:

Problem	Common tool	Better tool
Share reports and presentations	Microsoft Office	R Markdown
Summarize and share your interactive analyses	R Scripts	R Notebooks
Update results (in batch) based on new inputs	Shiny	Parameterized reports

R For Data Science explains that, “It doesn’t matter how great your analysis is unless you can explain it to others: you need to communicate your results.” I highly recommend reading Part V of this book, which has chapters on using R Markdown as a unified authoring framework for data science, using R Markdown formats for effective communication, and using R Markdown workflows to create analysis notebooks. There are references at the end of these chapters that describe where to learn more about communication.

Interactive plots in Shiny

Thu, 20 Sep 2018 00:00:00 +0000

I wish this post existed when I was struggling to add interactive plots to my Shiny app. I was mainly focused on recreating functionality found in other “dashboarding” applications. When looking for options, I found that htmlwidgets were the closest to what companies usually expect. However, while they are great for client-side interactivity, I often hit walls with them when I try to add click-through interactivity because the functionality is either not there, is very limited, or is bloated. With r2d3 there is more work, but the gains in customization and interactivity make it by far the best choice, in my opinion.

I asked a good friend at work to help me test the sample app provided in this post. She was able to run it easily, but then told me that she didn’t know that she was supposed to click on things. Adding interactive plots is one of the most important capabilities to include in a Shiny app. Sadly though, it seems that very few do it. If we wish to offer an alternative to enterprise reporting and BI tools by using Shiny, we need to do our best to match the interactivity those other tools seem to offer out of the box.

The sample app

I put together a sample app that should run in your R session by simply copying the code. This will allow us to focus on the details of the approach, and not on the setup.

A working version of the app is available here: Shiny-r2d3-app

In this app, we can click on the bars and see the DT object update based on the value of the bar. When the drop-down changes, the plot will update with a nice transition, as well.

“D3 is hard”

The title is a quote of a luminary in the R community. A few months ago, I told him that I wanted to start using r2d3 but was struggling with making heads or tails of D3. This person has forgotten more than I will ever learn about pretty much any subject. If he says it’s hard, then I’m in for a world of hurt. Nevertheless, my naivete and stubbornness prevailed.

I’ve since discovered that D3 is a language with which the desired result can be obtained by using one of several coding approaches. The more I learn to use it, the more I like its flexibility as a stand-alone visualization language.

One thing that helped was to realize that D3 and ggplot2 are similar in the amount of flexibility they offer. Picture that what you are drawing for a bar plot are the actual rectangles, almost as if you’re using geom_rect(). Except that in D3, the 0,0 coordinates are top/left, as opposed to bottom/left, so we have to flip our thinking upside down when we create a visualization with D3. In addition, the vertical and horizontal positions and sizes are expressed in fractions (read: percentages), so there are no absolute positions.

A good way to start

After trying out several approaches, I think that a good way to start is by having a few “primer” D3 scripts that can be modified to suit a particular app.

r2d3 calls a D3 script with a .js extension. As a result, the D3 code sits outside the R script, away from view. With r2d3, a data.frame can be used to pass all sorts of attributes (x/y coordinates, colors, etc.) to D3.

A good way of thinking about these “primers” is that you are building your own geoms as .js scripts. So, once it’s done, you can pass the regular “right-side-up” coordinate data to r2d3 and it will know how to calculate the proper offsets to place the shapes in the correct spot.

A first primer

The idea in this section is to provide the smallest possible example that covers what I feel are the most important pieces that make up a presentable and functional product. My hope is that, if you find this interesting and useful for your line of work, you will take your time to dissect what each code section does, to learn the principles of this approach. This way, you can customize and even expand on the primer.

The first example below is not the full primer. Instead, it is the section where most of the nuances of how the primer works exist. I’ll use that to explain some of the mechanics.

You can copy-paste the following code in your R session and run it without worrying about file dependencies. I know how important that is when learning new things, so I’m using a small workaround to providing r2d3 a separate .js file by saving the contents of a character variable that contains the D3 script into a temporary file. This is probably not something that you’ll do in a final Shiny app, but it works well for this example. Based on how the R Views’ code highlighter is setup, all of the D3 code will be in red, and the R code mostly in black:

library(shiny)
library(dplyr)
library(r2d3)
library(forcats)

# D3 code inside an R character variable
r2d3_script <- "
// !preview r2d3 data= data.frame(y = 0.1, ylabel = '1%', fill = '#E69F00', mouseover = 'green', label = 'one', id = 1)
function svg_height() {return parseInt(svg.style('height'))}
function svg_width()  {return parseInt(svg.style('width'))}
function col_top()  {return svg_height() * 0.05; }
function col_left() {return svg_width()  * 0.20; }
function actual_max() {return d3.max(data, function (d) {return d.y; }); }
function col_width()  {return (svg_width() / actual_max()) * 0.55; }
function col_heigth() {return svg_height() / data.length * 0.95; }
var bars = svg.selectAll('rect').data(data);
bars.enter().append('rect')
    .attr('x',      col_left())
    .attr('y',      function(d, i) { return i * col_heigth() + col_top(); })
    .attr('width',  function(d) { return d.y * col_width(); })
    .attr('height', col_heigth() * 0.9)
    .attr('fill',   function(d) {return d.fill; })
    .attr('id',     function(d) {return (d.label); })
    .on('click', function(){
      Shiny.setInputValue('bar_clicked', d3.select(this).attr('id'), {priority: 'event'});
    })
    .on('mouseover', function(){
      d3.select(this).attr('fill', function(d) {return d.mouseover; });
    })
    .on('mouseout', function(){
      d3.select(this).attr('fill', function(d) {return d.fill; });
    });
"
# Save D3 code into a tempfile
r2d3_file <- tempfile()
writeLines(r2d3_script, r2d3_file)

# Shiny app starts here
ui <- fluidPage(
    d3Output("d3")
)

server <- function(input, output, session) {
    output$d3 <- renderD3({
        gss_cat %>%
            group_by(marital) %>%
            tally() %>%
            arrange(desc(n)) %>%
            mutate(
                y = n,
                ylabel = prettyNum(n, big.mark = ","),
                fill = "#E69F00",
                mouseover = "#0072B2"
            ) %>%
            r2d3(r2d3_file)
            # ^^ Use the temp file containing the D3 code
    })}

shinyApp(ui = ui, server = server)

The result should look like the screenshot below. In your R session, hovering over the bar will change the color. Also notice that the bars do not cover the entire window. This is because there are limits placed in the way of ratios within the functions used on the top of the script.

Code breakdown

First, is the D3 code:

I start by defining some canvas size function beginning with: function svg_height() {return parseInt(svg.style('height'))}. These allow for the correct relative placement and size, as well as adapting to a window resize. For example: function actual_max() {return d3.max(data, function (d) {return d.y; }); } obtains the value of the longest bar, and then: function col_width() {return (svg_width() / actual_max()) * 0.55; } makes sure that the largest rectangle (representing a bar) drawn is 55% the size of the window. I used to define these as regular D3 variables, but found that as functions, they worked more consistently when running with Shiny.
With var bars = svg.selectAll('rect').data(data);, we create a new rectangle - better said, a new rectangle set. Just like with geom_rect(), if you pass a vector with multiple values, it will create multiple rectangles. The last function, data(), tells D3 to use the data data set, which is the default name that r2d3 is using when it translates our data.frame to a D3-friendly format. This is the “secret sauce” that allows us to use that data as attributes of the plot.
The rectangles are initially drawn with: bars.enter().append('rect'). This will work fine as long as nothing changes. But with Shiny, we want change, so in a later section, I will introduce the bars.transition() function.
Next, are the attributes (.attr). Attributes are interesting in these kinds of objects. They are all named as a character variable (x, fill, etc.), so it’s essentially free-form. Each type of D3 shape has its own set of expected attributes, such as x, y, and width, but I can also pass a “made-up” attribute and the script will not fail. In other words, if you pass an attribute of a “reserved” name for the shape. it will be used; for example, r is the attribute for radius of a D3 circle. But if the attribute does not exist, it just becomes metadata that we can use later on if we want. This comes in handy if we want an ID field to be passed to Shiny, but that ID field is not displayed in the plot. The downside is that a misspelled attribute will fail silently, so it makes debugging a bit difficult. In other words, make sure that your attributes are spelled correctly! In the meantime, defining x is easy because we want it to be as far to the left as possible.
Most attributes are set based on data passed via r2d3. We do that by wrapping the value of the attribute inside a function. We already told D3 where the data comes from, so it is implied that in function(d) the data object will be represented by d. Another interesting thing about these functions is the second argument, usually represented by i. It represents the “row number” of the observation. This means that a function like function(d, i) { return d.x * i} will give the attribute the value of the x variable of the data.frame we passed to r2d3, times the row number. So .attr('fill', function(d) {return d.fill; }) simply passes the fill value of our data.frame to D3. Notice that we can name these fields whatever we want; we just need to map them appropriately. With a primer, I found that it’s better to keep either matching (or at the very least, generic) names so we can use them for other plots.
The on() functions track named events, such as click, mouseover, and mouseout.
The click function will use a Shiny JavaScript function that makes the interaction possible. In Shiny.setInputValue('bar_clicked', d3.select(this).attr('id'), {priority: 'event'});, I specify the name of the input inside Shiny, so bar_clicked becomes input$bar_clicked in R. The attribute id is the value passed to R via that input. This is only a brief introduction to the topic; a much more detailed explanation with illustrations can be found in the r2d3 site.
The mouseover and mouseout events are used to get the color-changing, hover-over effect. On mouseover, the fill attribute is updated to use the highlighting color and then restore it to the original color when the pointer leaves with mouseout.

For the R/Shiny code:

As mentioned above, using r2d3_file <- tempfile() and then writeLines(r2d3_script, r2d3_file) is done to keep the D3 and R code in one location. This allows you to copy and run the script without worrying about dependencies.
r2d3 includes functions to interact with Shiny. The d3Output() function is used in the ui section of the app, and renderD3() is used in the server section of the app.
Using dplyr, the forcats::gss_cat data is transformed to fit what the primer expects. In other words, the variable that the total count obtained with tally() is renamed to y. Additionally, new fields are added to specify the colors. A note about colors with D3: you can pass color names (“red”), or the Hex code of the color (“#E69F00”). Some additional tips for Hex color selection can be found in the ggplot2 cookbook. A very nice application to test different color schemes and explore contrast with different color deficiencies is here.
Thanks to the fact that the r2d3() function uses the data as its first argument, we can simply pipe (%>%) the dplyr transformations directly to it. The only argument to pass to r2d3() is the location of the new temporary file.

The full example

Here is the full code for the sample app linked above. The D3 script is what I would consider a more complete “primer” that you can use in other apps. Copy and run the code to try out the Shiny app; as mentioned before, it should run without having to worry about any other file dependencies. More explanation and code breakdown is available after this code section:

library(shiny)
library(dplyr)
library(r2d3)
library(forcats)
library(DT)
library(rlang)

r2d3_script <- "
// !preview r2d3 data= data.frame(y = 0.1, ylabel = '1%', fill = '#E69F00', mouseover = 'green', label = 'one', id = 1)
function svg_height() {return parseInt(svg.style('height'))}
function svg_width()  {return parseInt(svg.style('width'))}
function col_top()  {return svg_height() * 0.05; }
function col_left() {return svg_width()  * 0.20; }
function actual_max() {return d3.max(data, function (d) {return d.y; }); }
function col_width()  {return (svg_width() / actual_max()) * 0.55; }
function col_heigth() {return svg_height() / data.length * 0.95; }

var bars = svg.selectAll('rect').data(data);
bars.enter().append('rect')
    .attr('x',      col_left())
    .attr('y',      function(d, i) { return i * col_heigth() + col_top(); })
    .attr('width',  function(d) { return d.y * col_width(); })
    .attr('height', col_heigth() * 0.9)
    .attr('fill',   function(d) {return d.fill; })
    .attr('id',     function(d) {return (d.label); })
    .on('click', function(){
      Shiny.setInputValue('bar_clicked', d3.select(this).attr('id'), {priority: 'event'});
    })
    .on('mouseover', function(){
      d3.select(this).attr('fill', function(d) {return d.mouseover; });
    })
    .on('mouseout', function(){
      d3.select(this).attr('fill', function(d) {return d.fill; });
    });
bars.transition()
  .duration(500)
    .attr('x',      col_left())
    .attr('y',      function(d, i) { return i * col_heigth() + col_top(); })
    .attr('width',  function(d) { return d.y * col_width(); })
    .attr('height', col_heigth() * 0.9)
    .attr('fill',   function(d) {return d.fill; })
    .attr('id',     function(d) {return d.label; });
bars.exit().remove();

// Identity labels
var txt = svg.selectAll('text').data(data);
txt.enter().append('text')
    .attr('x', width * 0.01)
    .attr('y', function(d, i) { return i * col_heigth() + (col_heigth() / 2) + col_top(); })
    .text(function(d) {return d.label; })
    .style('font-family', 'sans-serif');
txt.transition()
    .duration(1000)
    .attr('x', width * 0.01)
    .attr('y', function(d, i) { return i * col_heigth() + (col_heigth() / 2) + col_top(); })
    .text(function(d) {return d.label; });
txt.exit().remove();

// Numeric labels
var totals = svg.selectAll().data(data);
totals.enter().append('text')
    .attr('x', function(d) { return ((d.y * col_width()) + col_left()) * 1.01; })
    .attr('y', function(d, i) { return i * col_heigth() + (col_heigth() / 2) + col_top(); })
    .style('font-family', 'sans-serif')
    .text(function(d) {return d.ylabel; });
totals.transition()
    .duration(1000)
    .attr('x', function(d) { return ((d.y * col_width()) + col_left()) * 1.01; })
    .attr('y', function(d, i) { return i * col_heigth() + (col_heigth() / 2) + col_top(); })
    .attr('d', function(d) { return d.x; })
    .text(function(d) {return d.ylabel; });
totals.exit().remove();
"
r2d3_file <- tempfile()
writeLines(r2d3_script, r2d3_file)

ui <- fluidPage(
  selectInput("var", "Variable",
              list("marital", "rincome", "partyid", "relig", "denom"),
              selected = "marital"),
  d3Output("d3"),
  DT::dataTableOutput("table"),
  textInput("val", "Value", "Married")
)

server <- function(input, output, session) {
  output$d3 <- renderD3({
    gss_cat %>%
      mutate(label = !!sym(input$var)) %>%
      group_by(label) %>%
      tally() %>%
      arrange(desc(n)) %>%
      mutate(
        y = n,
        ylabel = prettyNum(n, big.mark = ","),
        fill = ifelse(label != input$val, "#E69F00", "red"),
        mouseover = "#0072B2"
      ) %>%
      r2d3(r2d3_file)
  })
  observeEvent(input$bar_clicked, {
      updateTextInput(session, "val", value = input$bar_clicked)
  })
  output$table <- renderDataTable({
    gss_cat %>%
      filter(!!sym(input$var) == input$val) %>%
      datatable()
  })
}

shinyApp(ui = ui, server = server)

Additions to D3 code

Hopefully, you can see a coding pattern emerging in the more lengthy example above. Here are some explanations for items that are new or outside the pattern:

The bars.transition() function “re-draws” the shape or text when the underlying data changes, when we make a change within the Shiny app. The duration() function defines the time that the changes take. Be sure to copy all of the attributes from the enter() function. This is needed when adding D3 plots into a Shiny app.
The var txt = svg.selectAll('text').data(data); code adds a new text object, similar to geom_text(). The same coding pattern as the rect shape applies. The additions are: a text() function that defines what its displayed on screen (note that there’s no attr('text',...), and the style() function to allow setting the font type size.

Setting up the Shiny interactivity

There are three options to integrate the Shiny input created inside the D3 script:

Have a given Shiny output react to the D3/Shiny input. An example would be to use it as a value to filter data in filter(id_field == input$bar_clicked). This works OK when there are not too many plots to integrate, but for a large dashboard, the second option would be better. An example of this approach can be found here.
Use Shiny’s observeEvent() to monitor the D3/Shiny input and have it run a specific action based on the value of the input. I usually use this approach to update another Shiny input in the app, and that is the approach used in this app.
Use the reactive() function to wrap all of the data transformations that are common across all of the plots inside the dashboard. Then have each plot use that function as the base of further dplyr transformations. That approach can be found in the Enterprise Dashboards article on db.rstudio.com; here is a direct link to the code.

Other R additions

A few additional tips that are helpful, but not mandatory:

To get the effect of keeping the selected bar with a different color than the others, I used an ifelse() inside the mutate() that checks if a particular row matches to the selected input: fill = ifelse(label != input$val, "#E69F00", "red").
In this line: mutate(label = !!sym(input$var)), I am using rlang’s convention to allow for the plot to change the field that it is displaying. This is a very rare requirement in an app, so I hope that it doesn’t throw anyone off. This is an advanced R programming concept not necessary for D3/Shiny.
I decided to use a separate field with the total count (y) and the label that will be shown in that bar (ylabel). It was easier for me to edit the format in R than in D3. Some may decide to do that in the D3 script.

RStudio 1.2

If you have the RStudio IDE Preview Release installed, you can easily preview the D3 visualization right in the Viewer pane. Information on how to do this is here.

In the first line in the script above, there is a D3 comment line with metadata that RStudio will pass to r2d3 so that you do not run R code in the console to see a preview. This integration also lets us use the IDE to edit the D3 file, which accelerates learning D3.

To try this out with the visualization above, copy and paste the contents of the r2d3_script variable to a new D3 file inside the RStudio IDE.

Closing words

Thank you for making it this far! Even if you were just skimming, I hope one or two things I’ve shown were interesting enough to consider trying out the exercise.

Sometimes, we forget how far we have progressed on a subject and forget what it feels like to begin the learning process. Hopefully these explanations avoid this pitfall and will simplify your learning experience. Please feel free to ask questions or start a topic of discussion at community.rstudio.com, where many are happy to help!

Here are some additional links to resources that you may want to check out. The first two I wrote for RStudio documentation:

Using r2d3 with Shiny
Enterprise-ready dashboards
The D3 API reference is really good, I use it often.
The r2d3 site has a great Gallery and articles to review. It has a section about learning D3.

Slack and Plumber, Part One

Thu, 30 Aug 2018 00:00:00 +0000

In the previous post, we introduced plumber as a way to expose R processes and programs to external systems via REST API endpoints. In this post, we’ll go further by building out an API that powers a Slack slash command, all from within R using plumber. A subsequent post will outline deploying and securing the API.

We will create an API built on top of simulated customer call data that powers a slash command. This command will allows users to view a customer status report within Slack. As shown, this status report contains customer name, total calls, date of birth, and a plot of call history for the past 20 weeks. The simulated data, along with the script used to create it, can be found in the GitHub repository for this example.

Setup

Slack is a commonly used communication tool that’s highly customizable through various integrations. It’s even possible to build your own integrations, which is what we’ll be doing here. In order to build a Slack app, you need to have a Slack account and follow the instructions for creating an app. In this example, we will build an app that includes a slash command that users can access by typing /<command-name> into a Slack message.

The Slack request

In this scenario, we’re building an API that will interact with a known request. This means that we need to understand the nature of the incoming request so that we can appropriately handle it within the API. Slack provides some documentation about the request that is sent when a slash command is invoked. In short, an HTTP POST request is made that contains a URL-encoded data payload. An example data payload looks like:

token=gIkuvaNzQIHg97ATvDxqgjtO
&team_id=T0001
&team_domain=example
&enterprise_id=E0001
&enterprise_name=Globular%20Construct%20Inc
&channel_id=C2147483705
&channel_name=test
&user_id=U2147483697
&user_name=Steve
&command=/weather
&text=94070
&response_url=https://hooks.slack.com/commands/1234/5678
&trigger_id=13345224609.738474920.8088930838d88f008e0

There’s a lot of detail included in the Slack request, and the Slack documentation provides details about each field. We’re mainly interested in the text field, which contains the text entered into Slack after the slash command. In the above example, the user entered /weather 94070 into Slack, so the request indicates that the command was /weather and the text was 94070.

Note that this approach is different from APIs that are not being built around a known request or specification. In such instances, we are free to expose endpoints and return data in whatever method seems most beneficial to downstream consumers. In such a scenario, we would provide downstream API consumers with an understanding of how the API handles requests and what types of responses are generated so that they can appropriately interact with the API. But in this example, the design of our API is, in part, dictated by the specifications Slack provides.

Building the API

Now that we have an understanding of what is included in the incoming request, we can begin to build out the API using plumber. First, we need to set up the global environment for the API by loading necessary packages and global objects, including the simulated data this API is built on. In reality, this data would likely come from an external database accessed via an ODBC connection.

# Packages ----
library(plumber)
library(magrittr)
library(ggplot2)

# Data ----
# Load sample customer data
sim_data <- readr::read_rds("data/sim-data.rds")

The following diagram outlines what we want to build.

In essence, an incoming request will pass through two filters before reaching an endpoint. This first filter is responsible for routing incoming requests to the correct endpoint. This is done so that a single slash command can serve multiple endpoints without the need to create separate commands for each service. The second filter simply logs details about the request for future review.

Filters

The first filter is responsible for parsing the incoming request and ensuring it is assigned to the appropriate endpoint. This is done because when a slash command is created in Slack, there is only one endpoint defined for requests made from the command. This filter enables several endpoints to be utilized by the same slash command by parsing the incoming text of the command and treating the first value of that command as the endpoint to which the request should be routed.

#* Parse the incoming request and route it to the appropriate endpoint
#* @filter route-endpoint
function(req, text = "") {
  # Identify endpoint
  split_text <- urltools::url_decode(text) %>%
    strsplit(" ") %>%
    unlist()
  
  if (length(split_text) >= 1) {
    endpoint <- split_text[[1]]
    
    # Modify request with updated endpoint
    req$PATH_INFO <- paste0("/", endpoint)
    
    # Modify request with remaining commands from text
    req$ARGS <- split_text[-1] %>% 
      paste0(collapse = " ")
  }
  
  # Forward request 
  forward()
}

This filter requires an understanding of the req object. It’s important to note that a few things happen in this filter. First, we parse the text argument and use the first part of text as the req$PATH_INFO, which tells plumber where to route the request. Second, we take anything remaining from text and attach it to the request in req$ARGS. This means that any downstream filters or endpoints will have access to req$ARGS. The second filter is taken straight from the plumber documentation and simply logs details about the incoming request.

#* Log information about the incoming request
#* @filter logger
function(req){
  cat(as.character(Sys.time()), "-", 
      req$REQUEST_METHOD, req$PATH_INFO, "-", 
      req$HTTP_USER_AGENT, "@", req$REMOTE_ADDR, "\n")
  
  # Forward request
  forward()
}

Endpoints

There are a few endpoints we need to define. First, we need to define an endpoint that provides a response Slack can understand and interpret into a message. In this case, we’re going to return a JSON object that Slack interprets into a message with attachments. Slack provides detailed documentation on what fields it accepts in a response. Also note that Slack expects unboxed JSON, while the plumber default is to return boxed JSON. In order to ensure that Slack understands the response, we set the serializer for this response to be unboxedJSON.

#* Return a message containing status details about the customer
#* @serializer unboxedJSON
#* @post /status
function(req, res) {
  # Check req$ARGS and match to customer - if no customer match is found, return
  # an error
  
  customer_ids <- unique(sim_data$id)
  customer_names <- unique(sim_data$name)
  
  if (!as.numeric(req$ARGS) %in% customer_ids & !req$ARGS %in% customer_names) {
    res$status <- 400
    return(
      list(
        response_type = "ephemeral",
        text = paste("Error: No customer found matching", req$ARGS)
      )
    )
  }
  
  # Filter data to customer data based on provided id / name
  if (as.numeric(req$ARGS) %in% customer_ids) {
    customer_id <- as.numeric(req$ARGS)
    customer_data <- dplyr::filter(sim_data, id == customer_id)
    customer_name <- unique(customer_data$name)
  } else {
    customer_name <- req$ARGS
    customer_data <- dplyr::filter(sim_data, name == customer_name)
    customer_id <- unique(customer_data$id)
  }
  
  # Simple heuristics for customer status
  total_customer_calls <- sum(customer_data$calls)
  
  customer_status <- dplyr::case_when(total_customer_calls > 250 ~ "danger",
                                      total_customer_calls > 130 ~ "warning",
                                      TRUE ~ "good")
  
  # Build response
  list(
    # response type - ephemeral indicates the response will only be seen by the
    # user who invoked the slash command as opposed to the entire channel
    response_type = "ephemeral",
    # attachments is expected to be an array, hence the list within a list
    attachments = list(
      list(
        color = customer_status,
        title = paste0("Status update for ", customer_name, " (", customer_id, ")"),
        fallback = paste0("Status update for ", customer_name, " (", customer_id, ")"),
        # History plot
        image_url = paste0("localhost:5762/plot/history/", customer_id),
        # Fields provide a way of communicating semi-tabular data in Slack
        fields = list(
          list(
            title = "Total Calls",
            value = sum(customer_data$calls),
            short = TRUE
          ),
          list(
            title = "DoB",
            value = unique(customer_data$dob),
            short = TRUE
          )
        )
      )
    )
  )
}

There are three main things that happen in this endpoint. First, we check to ensure that the provided customer name or ID appear in the dataset. Next, we create a subset of the data for only the identified customer and use a simple heuristic to determine the customer’s status. Finally, we put a list together that will be serialized into JSON in response to requests made to this endpoint. This list conforms to the standards outlined by Slack.

The second endpoint is used to provide the history plot that is referenced in the first endpoint. When an image_url is provided in a Slack attachment, Slack uses a GET request to fetch the image from the URL. So, this endpoint responds to incoming GET requests with an image.

#* Plot customer weekly calls
#* @png
#* @param cust_id ID of the customer
#* @get /plot/history/<cust_id:int>
function(cust_id, res) {
  # Throw error if cust_id doesn't exist in data
  if (!cust_id %in% sim_data$id) {
    res$status <- 400
    stop("Customer id" , cust_id, " not found.")
  }
  
  # Filter data to customer id provided
  plot_data <- dplyr::filter(sim_data, id == cust_id)
  
  # Customer name (id)
  customer_name <- paste0(unique(plot_data$name), " (", unique(plot_data$id), ")")
  
  # Create plot
  history_plot <- plot_data %>%
    ggplot(aes(x = time, y = calls, col = calls)) +
    ggalt::geom_lollipop(show.legend = FALSE) +
    theme_light() +
    labs(
      title = paste("Weekly calls for", customer_name),
      x = "Week",
      y = "Calls"
    )
  
  # print() is necessary to render plot properly
  print(history_plot)
}

Once these pieces are together, you can run the API either through the UI as described in the previous post, or by running plumber::plumb("plumber.R")$run(port = 5762) from the directory containing the API.

Testing the API

Once the API is up and running, we can test it to make sure it’s behaving as we expect. Since the main point of contact is making a POST request to the /status endpoint, it’s easiest to interact with the API through curl.

$ curl -X POST --data '{"text":"status 1"}' localhost:5762 | jq '.'
{
  "response_type": "ephemeral",
  "attachments": [
    {
      "color": "good",
      "title": "Status update for Rahul Wilderman IV (001)",
      "fallback": "Status update for Rahul Wilderman IV (001)",
      "image_url": "localhost:5762/plot/history/001",
      "fields": [
        {
          "title": "Total Calls",
          "value": 27,
          "short": true
        },
        {
          "title": "DoB",
          "value": "2004-04-01",
          "short": true
        }
      ]
    }
  ]
}

Success! Our API successfully routed our request to the appropriate endpoint and returned a valid JSON response. As a final check, we can visit the image_url in our browser to see if the plot is properly rendered.

Everything appears to be running as expected!

Conclusion

In this post, we used plumber to create an API that can properly interact with the Slack slash command interface. In the next post, we will explore API security and deployment. Continuing with this example, we will secure our API using Slack’s guidelines, deploy the API, and finally connect Slack so that we can use our new slash command. At the conclusion of the next post, we will have a fully functioning Slack slash command, all built using R.

James Blair is a solutions engineer at RStudio who focusses on tools, technologies, and best practices for using R in the enterprise.

Learning Analytic Administration through a Sandbox

Thu, 23 Aug 2018 00:00:00 +0000

It all starts with sandboxes. Development sandboxes are dedicated safe spaces for experimentation and creativity. A sandbox is a place where you can go to test and break things, without the ramifications of breaking the real, important things. If you’re an analytic administrator who doesn’t have access or means to get a sandbox, I recommend that you consider advocating to change that. Here are just some of the arguments for why sandboxes are a powerful tool for the R admin that you may find helpful.

Sandbox experimentation develops valuable experience and promotes exposure to best practices.
Sandboxes can be used to demonstrate quick wins or establish grounds for future investments.
Sandboxes can increase engagement with the IT group through communicating from a more informed position.
They can be instrumental in creating installation and configuration recipes for the administration of R in production.

To be an effective R admin, I have to learn through doing. In my case, this often means standing up small server instances through Amazon Web Services so that I can test out different configurations or architectures. I like to follow a fairly regimented crawl-walk-run strategy for acquiring R administration knowledge, but things still slip through the cracks.

For example, I wish I had taken time to explore the very basic Run As :HOME_USER: configuration pattern when I was first learning the ropes of Shiny Server. This solves a very interesting problem: even with Shiny Server and RStudio Server installed on the same machine, Shiny applications developed in a user’s home directory within the RStudio IDE still need to be “deployed” to the Shiny Server directory in order to be made accessible there.

The Shiny Server documentation lays out a simple and elegant way to run applications as the user in whose home directory that app exists, thus circumventing the need to deploy from one location on the server to another. While this solution may not be desirable for many situations, it has great merits as a sandbox:

The single-server infrastructure can be installed and configured in minutes.
It can give you and your team a quick win if you’re looking to create a proof of concept.
You’ll gain exposure to the Shiny Server documentation and learn how to make edits to the default shiny-server configuration file.
You can create a recipe for installation and configuration that could potentially be reused by you or others, including IT.

In this post, I’ll go through the high-level steps it takes to implement this configuration as a sandbox server running on a single Amazon Web Services Elastic Cloud Compute (AWS EC2) instance. I’m going to assume you have very little experience with the technologies involved, but that you’re a tenacious R admin-in-training, hungry to learn and read whatever is necessary.

Note: sandboxes can be created on all sorts of different servers. I’ve chosen an AWS EC2 instance because it is an easily accessible and commonly used cloud platform, but you could create a sandbox on your local machine with a Virtual Machine, using something like VirtualBox; use another cloud provider; or find a different solution entirely. If you already have a fresh sandbox server to play with, skip the first section and proceed straight to Setting up the Sandbox.

Getting Started with Amazon Web Services and Elastic Cloud Compute

There are a few things you’ll need to do to get started with AWS EC2. First, you need an AWS account. That will require some initial setup and a credit card. Once you have all of that, you’ll have access to the Amazon Web Services console. This is the view of all the web services Amazon has to offer - it can be quite overwhelming to ponder. The service we’re interested in today is Elastic Cloud Compute (EC2). If you’re looking at the All Services view, it should be listed under Compute.

On the EC2 console page, you’ll need to do a couple of things:

Create a key pair and download it
Launch an Instance (click the blue button under “Create Instance” to go to the launch wizard)

Stepping through the launch wizard, you’ll have many options. Here were my selections:

Take special note of the security group formulation. I added two custom TCP rules for opening port 8787 (RStudio Server’s default port) and 3838 (Shiny Server’s default port).

At this point you’re ready to launch.

The Instances view under Resources on the main EC2 console page will show you a list of all the running EC2 instances you have in this region. Once the instance you launched is listed as running, you’ll want to connect to it. Click on your instance to select it in the list; the Connect button should become enabled once you do.

Click the Connect button and follow the steps listed there to SSH into your EC2 instance. Congrats - you now have a fresh CentOS-flavored Linux machine to learn on and configure!

Setting up the Sandbox: Installation

Now that you have a clean sandbox, it’s time to bring in the toys.

Install and enable the Extra Packages for Enterprise Linux (EPEL) repository
Follow the guidelines for installing R and the shiny package library listed in the instructions for Shiny Server open source
Continue using the same instructions to download and install Shiny Server

At this point, the Shiny Server service will start running with all default configurations in place. Go back to the EC2 console and your Connection dialog pane to grab the public DNS address. Navigate to that address in a web browser, using port 3838 (e.g. http://ec2-public-dns:3838)

You should see the welcome page!

The Shiny Server welcome page has two panels on the right-hand side. The top frame should feature a functional shiny application. The bottom frame is meant to show an R markdown document, but because you haven’t yet configured the server to host those documents, it should show an error message. If hosting R markdown documents is important to the success of your sandbox, learn how to set that up.

Now that Shiny Server is up and running, you’ll need to go through a similar installation process for RStudio Server. Remember that our plan from the beginning was to install both these services on the same machine - don’t create a new EC2 instance for RStudio Server.

Once installed, open a separate web browser window or tab and navigate to the public DNS again, but this time at port 8787 (e.g. http://ec2-public-dns:8787). Here you should see the RStudio Server sign-in landing page. To sign in to RStudio Server you’ll need a user and password. As the sandbox administrator, it’s your job to create this first user. Take a look at the RStudio Data Science Lab manual for instructions on how to do this. After you create a user, verify that you can sign into the RStudio Server IDE. This is where you’ll be able to build new Shiny applications.

Configuring the Sandbox

Shiny Server and RStudio Server should now be installed and running on your machine. The installation step is usually the easy part. Configuration tends to be harder. This is the stage where you’ll start adapting the default product so that it can perform in your particular environment. Configuration changes should be made based on your goals, architecture, and ultimately the type of experience you would like the end user to have with this software. In some cases, you may end up testing and combining configuration options from very different sources in the documentation. It can be easy to lose track of what changes were made, which is why keeping notes, or making step-by-step recipes is important.

Your goal is to change the default configuration of Shiny Server so that users (Shiny developers) in the RStudio IDE can save applications to a folder in their home directory, and have those applications be run as the home user and served from that home location.

To accomplish this change, find and edit the shiny-server.conf file.

There are two sections of the Shiny Server documentation that I found helpful in crafting my changes to the configuration file:

When you finish making changes to the configuration file, restart the shiny-server service.

Test your changes! The template new Shiny application should make it easy to test your deployment configuration. My user is named rstudio and this is what the tree structure of my home directory looks like for the deployment of a Shiny application, app1.

From the Shiny Server side, app1 is available at: http://ec2-public-dns:3838/rstudio/app1/

Write a Recipe and Retire the Sandbox

Remember to summarize your notes into scripts that you can reuse or just save as a reference. I like to keep my installation and configuration scripts in a version control system like git so that I have a lasting record of all the changes I make over time. Don’t worry if your script isn’t perfect right now. We will cover techniques for writing recipes to meet IT standards in a later post.

The final step of this process is to shut everything down. Once I declare success, make the notes I want to keep, and share any lessons learned, it’s time to terminate. If you invest in writing out a recipe script now, it shouldn’t take much time to recreate this sandbox. There’s no reason to spend money keeping it running for longer than you need it.

Use the Actions button in the Instances view of the EC2 console to terminate any running instances that you’re finished using.

Conclusion

As an analytic administrator, the job of legitimizing R and advocating for the best, cutting-edge software falls on you. This is challenging, potentially frustrating, but hopefully ultimately rewarding work. There are an infinite number of sandboxes to create and learn from; hopefully, this post will inspire you to pursue the creation and design of some of your own. Remember that sandboxes are a great tool for demonstrating the value of R as a proof-of-concept, or teaching yourself a new set of skills, but they generally aren’t meant to be taken into production.

For more information on running Shiny in production in an enterprise environment, I would recommend starting with an evaluation of RStudio Connect and RStudio Server Pro. There will also be a workshop at RStudio conf 2019 called Shiny in Production | Data Products at Scale taught by me and my colleague, Sean Lopp, which may be of interest to you. In the meantime, if you build a cool sandbox or learn something worth sharing, we hope you’ll post about it on the RStudio community forum for R admins.

Kelly O’Briant is a solutions engineer at RStudio interested in configuration and workflow management with a passion for R administration.

REST APIs and Plumber

Mon, 23 Jul 2018 00:00:00 +0000

Moving R resources from development to production can be a challenge, especially when the resource isn’t something like a shiny application or rmarkdown document that can be easily published and consumed. Consider, as an example, a customer success model created in R. This model is responsible for taking customer data and returning a predicted outcome, like the likelihood the customer will churn. Once this model is developed and validated, there needs to be some way for the model output to be leveraged by other systems and individuals within the company.

Traditionally, moving this model into production has involved one of two approaches: either running customer data through the model on a batch basis and caching the results in a database, or handing the model definition off to a development team to translate the work done in R into another language, such as Java or Scala. Both approaches have significant downsides. Batch processing works, but it misses real-time updates. For example, if the batch job runs every night and a customer calls in the next morning and has a heated conversation with support, the model output will have no record of that exchange when the customer calls the customer loyalty department later the same day to cancel their service. In essence, model output is served on a lag, which can sometimes lead to critical information loss. However, the other option requires a large investment of time and resources to convert an existing model into another language just for the purpose of exposing that model as a real-time service. Neither of these approaches is ideal; to solve this problem, the optimal solution is to expose the existing R model as a service that can be easily accessed by other parts of the organization.

plumber is an R package that allows existing R code to be exposed as a web service through special decorator comments. With minimal overhead, R programmers and analysts can use plumber to create REST APIs that expose their work to any number of internal and external systems. This solution provides real-time access to processes and services created entirely in R, and can effectively eliminate the need to perform batch operations or technical hand-offs in order to move R code into production.

This post will focus on a brief introduction to RESTful APIs, then an introduction to the plumber package and how it can be used to expose R services as API endpoints. In subsequent posts, we’ll build a functioning web API using plumber that integrates with Slack and provides real-time customer status reports.

Web APIs

For some, APIs (Application Programming Interface) are things heard of but seldom seen. However, whether seen or unseen, APIs are part of everyday digital life. In fact, you’ve likely used a web API from within R, even if you didn’t recognize it at the time! Several R packages are simply wrappers around popular web APIs, such as tidycensus and gh. Web APIs are a common framework for sharing information across a network, most commonly through HTTP.

HTTP

To understand how HTTP requests work, it’s helpful to know the players involved. A client makes a request to a server, which interprets the request and provides a response. An HTTP request can be thought of simply as a packet of information sent to the server, which the server attempts to interpret and respond to. Every time you visit a URL in a web browser, an HTTP request is made and the response is rendered by the browser as the website you see. It is possible to inspect this interaction using the development tools in a browser.

As seen above, this request is composed of a URL and a request method, which in the case of a web browser accessing a website, is GET.

Request

There are several components of an HTTP request, but here we’ll mention on only a few.

URL: the address or endpoint for the request
Verb / method: a specific method invoked on the endpoint (GET, POST, DELETE, PUT)
Headers: additional data sent to the server, such as who is making the request and what type of response is expected
Body: data sent to the server outside of the headers, common for POST and PUT requests

In the browser example above, a GET request was made by the web browser to www.rstudio.com.

Response

The API response mirrors the request to some extent. It includes headers that contain information about the response and a body that contains any data returned by the API. The headers include the HTTP status code that informs the client how the request was received, along with details about the content that’s being delivered. In the example of a web browser accessing www.rstudio.com, we can see below that the response headers include the status code (200) along with details about the response content, including the fact that the content returned is HTML. This HTML content is what the browser renders into a webpage.

httr

The httr package provides a nice framework for working with HTTP requests in R. The following basic example demonstrates some of what we’ve already learned by using httr and httpbin.org, which provides a playground of sorts for HTTP requests.

library(httr)
# A simple GET request
response <- GET("http://httpbin.org/get")
response

## Response [http://httpbin.org/get]
##   Date: 2018-07-23 14:57
##   Status: 200
##   Content-Type: application/json
##   Size: 266 B
## {"args":{},"headers":{"Accept":"application/json, text/xml, application/...

In this example we’ve made a GET request to httpbin.org/get and received a response. We know our request was successful because we see that the status is 200. We also see that the response contains data in JSON format. The Getting started with httr page provides additional examples of working with HTTP requests and responses.

REST

Representational State Transfer (REST) is an architectural style for APIs that includes specific constraints for building APIs to ensure that they are consistent, performant, and scalable. In order to be considered truly RESTful, an API must meet each of the following six constraints:

Uniform interface: clearly defined interface between client and server
Stateless: state is managed via the requests themselves, not through reliance on an external service
Cacheable: responses should be cacheable in order to improve scalability
Client-Server: clear separation of client and server, each with it’s on distinct responsibilities in the exchange
Layered System: there may be intermediaries between the client and the server, but the client should be unaware of them
Code on Demand: the response can include logic executable by the client

We could spend a lot of time diving further into each of these specifications, but that is beyond the scope of this post. More detail about REST can be found here.

Plumber

Creating RESTful APIs using R is straightforward using the plumber package. Even if you have never written an API, plumber makes it easy to turn existing R functions into API endpoints. Developing plumber endpoints is simply a matter of providing specialized R comments before R functions. plumber recognizes both #' and #* comments, although the latter is recommended in order to avoid potential conflicts with roxygen2. The following defines a plumber endpoint that simply returns the data provided in the request query string.

library(plumber)

#* @apiTitle Simple API

#* Echo provided text
#* @param text The text to be echoed in the response
#* @get /echo
function(text = "") {
  list(
    message_echo = paste("The text is:", text)
  )
}

Here we’ve defined a simple function that takes a parameter, text, and returns it with some additional comments as part of a list. By default, plumber will serialize the object returned from a function into JSON using the jsonlite package. We’ve provided specialized comments to inform plumber that this endpoint is available at api-url/echo and will respond to GET requests.

There are a few ways this plumber script can be run locally. First, assuming the file is saved as plumber.R, the following code would start a local web server hosting the API.

plumber::plumb("plumber.R")$run(port = 5762)

Once the web server has started, the API can be interacted with using any set of HTTP tools. We could even interact with it using httr as demonstrated earlier, although we would need to open a separate R session to do so since the current R session is busy serving the API.

The other method for running the API requires a recent preview build of the RStudio IDE. Recent preview builds include features that make it easier to work with plumber. When editing a plumber script in a recent version of the IDE, a “Run API” icon will appear in the top right hand corner of the source editor. Clicking this button will automatically run a line of code similar to the one we ran above to start a web server hosting the API. A swagger-generated UI will be rendered in the Viewer pane, and the API can be interacted with directly from within this UI.

Now that we have a running plumber API, we can query it using curl from the command line to investigate it’s behavior.

$ curl "localhost:5762/echo" | jq '.'
{
  "message_echo": [
    "The text is: "
  ]
}

In this case, we queried the API without providing any additional data or parameters. As a result, the text parameter is the default empty string, as seen in the response. In order to pass a value to our underlying function, we can define a query string in the request as follows:

$ curl "localhost:5762/echo?text=Hi%20there" | jq '.'
{
  "message_echo": [
    "The text is: Hi there"
  ]
}

In this case, the text parameter is defined as part of the query string, which is appended to the end of the URL. Additional parameters could be defined by separating each key-value pair with &. It’s also possible to pass the parameter as part of the request body. However, to leverage this method of data delivery, we need to update our API definition so that the /echo endpoint also accepts POST requests. We’ll also update our API to consider multiple parameters, and return the parsed parameters along with the entire request body.

library(plumber)

#* @apiTitle Simple API

#* Echo provided text
#* @param text The text to be echoed in the response
#* @param number A number to be echoed in the response
#* @get /echo
#* @post /echo
function(req, text = "", number = 0) {
  list(
    message_echo = paste("The text is:", text),
    number_echo = paste("The number is:", number),
    raw_body = req$postBody
  )
}

With this new API definition, the following curl request can be made to pass parameters to the API via the request body.

$ curl --data "text=Hi%20there&number=42&other_param=something%20else" "localhost:5762/echo" | jq '.'
{
  "message_echo": [
    "The text is: Hi there"
  ],
  "number_echo": [
    "The number is: 42"
  ],
  "raw_body": [
    "text=Hi%20there&number=42&other_param=something%20else"
  ]
}

Notice that we passed more than just text and number in the request body. plumber parses the request body and matches any arguments found in the R function definition. Additional arguments, like other_param in this case, are ignored. plumber can parse the request body if it is URL-encoded or JSON. The following example shows the same request, but with the request body encoded as JSON.

$ curl --data '{"text":"Hi there", "number":"42", "other_param":"something else"}' "localhost:5762/echo" | jq '.'
{
  "message_echo": [
    "The text is: Hi there"
  ],
  "number_echo": [
    "The number is: 42"
  ],
  "raw_body": [
    "{\"text\":\"Hi there\", \"number\":\"42\", \"other_param\":\"something else\"}"
  ]
}

While these examples are fairly simple, they demonstrate the extraordinary facility of plumber. Thanks to plumber, it is now a fairly straightforward process to expose R functions so they can be consumed and leveraged by any number of systems and processes. We’ve only scratched the surface of its capabilities and, as mentioned, future posts will walk through the creation of a Slack app using plumber. Comprehensive documentation for plumber can be found here.

Deploying

Up until now, we’ve just been interacting with our APIs in our local development environment. That’s great for development and testing, but when it comes time to expose an API to external services, we don’t want our laptop held responsible (at least, I don’t!). There are several deployment methods for plumber outlined in the documentation. The most straightforward method of deployment is to use RStudio Connect. When editing a plumber script in recent versions of the RStudio IDE, a blue publish button will appear in the top right-hand corner of the source editor. Clicking this button brings up a menu that enables the user to publish the API to an instance of RStudio Connect. Once published, API access and performance can be configured through RStudio Connect and the API can be leveraged by external systems and processes.

Conclusion

Web APIs are a powerful mechanism for providing systematic access to computational processes. Writing APIs with plumber makes it easy for others to take advantage of the work you’ve created in R without the need to rely on batch processing or code rewriting. plumber is exceptionally flexible and can be used to define a wide variety of endpoints. These endpoints can be used to integrate R with other systems. As an added bonus, downstream consumers of these APIs require no knowledge of R. They only need to know how to properly interact with the API via HTTP. plumber provides a convenient and reliable bridge between R and other systems and/or languages used within an organization.

Reading and analysing log files in the RRD database format

Wed, 20 Jun 2018 00:00:00 +0000

I have frequent conversations with R champions and Systems Administrators responsible for R, in which they ask how they can measure and analyze the usage of their servers. Among the many solutions to this problem, one of the my favourites is to use an RRD database and RRDtool.

From Wikipedia:

RRDtool (round-robin database tool) aims to handle time series data such as network bandwidth, temperatures or CPU load. The data is stored in a circular buffer based database, thus the system storage footprint remains constant over time.

RRDtool is a library written in C, with implementations that can also be accessed from the Linux command line. This makes it convenient for system development, but makes it difficult for R users to extract and analyze this data.

I am pleased to announce that I’ve been working on the rrd R package to import RRD files directly into tibble objects, thus making it easy to analyze your metrics.

As an aside, the RStudio Pro products (specifically RStudio Server Pro and RStudio Connect) also make use of RRD to store metrics – more about this later.

Understanding the RRD format as an R user

The name RRD is an initialism of Round Robin Database. The “round robin” refers to the fact that the database is always fixed in size, and as a new entry enters the database, the oldest entry is discarded. In practical terms, the database collects data for a fixed period of time, and information that is older than the threshold gets removed.

A second quality of RRD databases is that each datum is stored in different “consolidation data points”, where every data point is an aggregation over time. For example, a data point can represent an average value for the time period, or a maximum over the period. Typical consolidation functions include average, min and max.

The third quality is that every RRD database file typically consists of multiple archives. Each archive measures data for a different time period. For instance, the archives can capture data for intervals of 10 seconds, 30 seconds, 1 minute or 5 minutes.

As an example, here is a description of an RRD file that originated in RStudio Connect:

describe_rrd("rrd_cpu_0")
#> A RRD file with 10 RRA arrays and step size 60
#> [1] AVERAGE_60 (43200 rows)
#> [2] AVERAGE_300 (25920 rows)
#> [3] MIN_300 (25920 rows)
#> [4] MAX_300 (25920 rows)
#> [5] AVERAGE_3600 (8760 rows)
#> [6] MIN_3600 (8760 rows)
#> [7] MAX_3600 (8760 rows)
#> [8] AVERAGE_86400 (1825 rows)
#> [9] MIN_86400 (1825 rows)
#> [10] MAX_86400 (1825 rows)

This RRD file contains data for the properties of CPU 0 of the system. In this example, the first RRA archive contains averaged metrics for one minute (60s) intervals, while the second RRA measures the same metric, but averaged over five minutes. The same metrics are also available for intervals of one hour and one day.

Notice also that every archive has a different number of rows, representing a different historical period where the data is kept. For example, the per minute data AVERAGE_60 is retained for 43,200 periods (12 days) while the daily data MAX_86400 is retained for 1,825 periods (5 years).

If you want to know more, please read the excellent introduction tutorial to RRD database.

Introducing the `rrd` package

Until recently, it wasn’t easy to import RRD files into R. But I was pleased to discover that a Google Summer of Code 2014 project created a proof-of-concept R package to read these files. The author of this package is Plamen Dimitrov, who published the code on GitHub and also wrote an explanatory blog post.

Because I had to provide some suggestions to our customers, I decided to update the package, provide some example code, and generally improve the reliability.

The result is not yet on CRAN, but you can install the development version of package from github.

Installing the package

To build the package from source, you first need to install librrd. Installing RRDtool from your Linux package manager will usually also install this library.

Using Ubuntu:

sudo apt-get install rrdtool librrd-dev

Using RHEL / CentOS:

sudo yum install rrdtool rrdtool-devel

Once you have the system requirements in place, you can install the development version of the R package from GitHub using:

# install.packages("devtools")
devtools::install_github("andrie/rrd")

Limitations

The package is not yet available for Windows.

Using the package

Once you’ve installed the package, you can start to use it. The package itself contains some built-in RRD files, so you should be able to run the following code directly.

library(rrd)

Describing the contents of a RRD

To describe the contents of an RRD file, use describe_rrd(). This function reports information about the names of each archive (RRA) file, the consolidation function, and the number of observations:

rrd_cpu_0 <- system.file("extdata/cpu-0.rrd", package = "rrd")

describe_rrd(rrd_cpu_0)
#> A RRD file with 10 RRA arrays and step size 60
#> [1] AVERAGE_60 (43200 rows)
#> [2] AVERAGE_300 (25920 rows)
#> [3] MIN_300 (25920 rows)
#> [4] MAX_300 (25920 rows)
#> [5] AVERAGE_3600 (8760 rows)
#> [6] MIN_3600 (8760 rows)
#> [7] MAX_3600 (8760 rows)
#> [8] AVERAGE_86400 (1825 rows)
#> [9] MIN_86400 (1825 rows)
#> [10] MAX_86400 (1825 rows)

Reading an entire RRD file

To read an entire RRD file, i.e. all of the RRA archives, use read_rrd(). This returns a list of tibble objects:

cpu <- read_rrd(rrd_cpu_0)

str(cpu, max.level = 1)
#> List of 10
#>  $ AVERAGE60   :Classes 'tbl_df', 'tbl' and 'data.frame':    43199 obs. of  9 variables:
#>  $ AVERAGE300  :Classes 'tbl_df', 'tbl' and 'data.frame':    25919 obs. of  9 variables:
#>  $ MIN300      :Classes 'tbl_df', 'tbl' and 'data.frame':    25919 obs. of  9 variables:
#>  $ MAX300      :Classes 'tbl_df', 'tbl' and 'data.frame':    25919 obs. of  9 variables:
#>  $ AVERAGE3600 :Classes 'tbl_df', 'tbl' and 'data.frame':    8759 obs. of  9 variables:
#>  $ MIN3600     :Classes 'tbl_df', 'tbl' and 'data.frame':    8759 obs. of  9 variables:
#>  $ MAX3600     :Classes 'tbl_df', 'tbl' and 'data.frame':    8759 obs. of  9 variables:
#>  $ AVERAGE86400:Classes 'tbl_df', 'tbl' and 'data.frame':    1824 obs. of  9 variables:
#>  $ MIN86400    :Classes 'tbl_df', 'tbl' and 'data.frame':    1824 obs. of  9 variables:
#>  $ MAX86400    :Classes 'tbl_df', 'tbl' and 'data.frame':    1824 obs. of  9 variables:

Since the resulting object is a list of tibble objects, you can easily use R functions to work with an individual archive:

names(cpu)
#>  [1] "AVERAGE60"    "AVERAGE300"   "MIN300"       "MAX300"      
#>  [5] "AVERAGE3600"  "MIN3600"      "MAX3600"      "AVERAGE86400"
#>  [9] "MIN86400"     "MAX86400"

To inspect the contents of the first archive (AVERAGE60), simply print the object - since it’s a tibble, you get 10 lines of output.

For example, the CPU metrics contains a time stamp and metrics for average user and sys usage, as well as the nice value, idle time, interrupt requests and soft interrupt requests:

cpu[[1]]
#> # A tibble: 43,199 x 9
#>    timestamp              user     sys  nice  idle  wait   irq softirq
#>  * <dttm>                <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1 2018-04-02 12:24:00 0.0104  0.00811     0 0.981     0     0       0
#>  2 2018-04-02 12:25:00 0.0126  0.00630     0 0.979     0     0       0
#>  3 2018-04-02 12:26:00 0.0159  0.00808     0 0.976     0     0       0
#>  4 2018-04-02 12:27:00 0.00853 0.00647     0 0.985     0     0       0
#>  5 2018-04-02 12:28:00 0.0122  0.00999     0 0.978     0     0       0
#>  6 2018-04-02 12:29:00 0.0106  0.00604     0 0.983     0     0       0
#>  7 2018-04-02 12:30:00 0.0147  0.00427     0 0.981     0     0       0
#>  8 2018-04-02 12:31:00 0.0193  0.00767     0 0.971     0     0       0
#>  9 2018-04-02 12:32:00 0.0300  0.0274      0 0.943     0     0       0
#> 10 2018-04-02 12:33:00 0.0162  0.00617     0 0.978     0     0       0
#> # ... with 43,189 more rows, and 1 more variable: stolen <dbl>

Since the data is in tibble format, you can easily extract specific data, e.g., the last values of the system usage:


tail(cpu$AVERAGE60$sys)
#> [1] 0.0014390667 0.0020080000 0.0005689333 0.0000000000 0.0014390667
#> [6] 0.0005689333

Reading only a single archive

The underlying code in the rrd package is written in C, and is therefore blazingly fast. Reading an entire RRD file takes a fraction of a second, but sometimes you may want to extract a specific RRA archive immediately.

To read a single RRA archive from an RRD file, use read_rra(). To use this function, you must specify several arguments that define the specific data to retrieve. This includes the consolidation function (e.g., "AVERAGE") and time step (e.g., 60). You must also specify either the start time or the number of steps, n_steps.

In this example, I extract the average for one-minute periods (step = 60) for one day (n_steps = 24 * 60):

end_time <- as.POSIXct("2018-05-02") # timestamp with data in example
avg_60 <- read_rra(rrd_cpu_0, cf = "AVERAGE", step = 60, n_steps = 24 * 60,
                     end = end_time)

avg_60
#> # A tibble: 1,440 x 9
#>    timestamp              user      sys  nice  idle     wait   irq softirq
#>  * <dttm>                <dbl>    <dbl> <dbl> <dbl>    <dbl> <dbl>   <dbl>
#>  1 2018-05-01 00:01:00 0.00458 0.00201      0 0.992 0            0       0
#>  2 2018-05-01 00:02:00 0.00258 0.000570     0 0.996 0            0       0
#>  3 2018-05-01 00:03:00 0.00633 0.00144      0 0.992 0            0       0
#>  4 2018-05-01 00:04:00 0.00515 0.00201      0 0.991 0            0       0
#>  5 2018-05-01 00:05:00 0.00402 0.000569     0 0.995 0            0       0
#>  6 2018-05-01 00:06:00 0.00689 0.00144      0 0.992 0            0       0
#>  7 2018-05-01 00:07:00 0.00371 0.00201      0 0.993 0.00144      0       0
#>  8 2018-05-01 00:08:00 0.00488 0.00201      0 0.993 0.000569     0       0
#>  9 2018-05-01 00:09:00 0.00748 0.000568     0 0.992 0            0       0
#> 10 2018-05-01 00:10:00 0.00516 0            0 0.995 0            0       0
#> # ... with 1,430 more rows, and 1 more variable: stolen <dbl>

Plotting the results

The original RRDTool library for Linux contains some functions to easily plot the RRD data, a feature that distinguishes RRD from many other databases.

However, R already has very rich plotting capability, so the rrd R package doesn’t expose any specific plotting functions.

For example, you can easily plot these data using your favourite packages, like ggplot2:

library(ggplot2)
ggplot(avg_60, aes(x = timestamp, y = user)) + 
  geom_line() +
  stat_smooth(method = "loess", span = 0.125, se = FALSE) +
  ggtitle("CPU0 usage, data read from RRD file")

Getting the RRD files from RStudio Server Pro and RStudio Connect

As I mentioned in the introduction, both RStudio Server Pro and RStudio Connect use RRD to store metrics. In fact, these metrics are used to power the administration dashboard of these products.

This means that often the easiest solution is simply to enable the admin dashboard and view the information there.

However, sometimes R users and system administrators have a need to analyze the metrics in more detail, so in this section, I discuss where you can find the files for analysis.

The administration guides for these products explain where to find the metrics files:

The admin guide for RStudio Server Pro discusses metrics in this in section 8.2 Monitoring Configuration.
- By default, the metrics are stored at /var/lib/rstudio-server/monitor/rrd, although this path is configurable by the server administrator
- RStudio Server Pro stores system metrics as well as user metrics
RStudio Connect discusses metrics in section 16.1 Historical Metrics
- The default path for metrics logs is /var/lib/rstudio-connect/metrics, though again, this is configurable by the server administrator.

rsc <- "/var/lib/rstudio-connect/metrics/rrd"
rsp <- "/var/lib/rstudio-server/monitor/rrd"

If you want to analyze these files, it is best to copy the files to a different location. The security and permissions on both products are configured in such a way that it’s not possible to read the files while they are in the original folder. Therefore, we recommend that you copy the files to a different location and do the analysis there.

Warning about using the RStudio Connect RRD files:

The RStudio Connect team is actively planning to change the way content-level metrics are stored, so data related to shiny apps, markdown reports, etc. will likely look different in a future release.

To be clear:

The schemas might change
RStudio Connect may stop tracking some metrics
It’s also possible that the entire mechanism might change

The only guarantees that we make in RStudio Connect are around the data that we actually surface:

server-wide user counts
RAM
CPU data

This means that if you analyze RRD files, you should be aware that the entire mechanism for storing metrics might change in future.

Additional caveat

The metrics collection process runs in a sandboxed environment, and it is not possible to publish a report to RStudio Connect that reads the metrics directly. If you want to automate a process to read the Connect metrics, you will have to set up a cron job to copy the files to a different location, and run the analysis against the copied files. (Also, re-read the warning that everything might change!)

Example

In the following worked example, I copied some rrd files that originated in RStudio Connect to a different location on disk, and stored this in a config file.

First, list the file names:

config <- config::get()
rrd_location <- config$rrd_location
rrd_location %>% 
  list.files() %>% 
  tail(20)

##  [1] "content-978.rrd"      "content-986.rrd"      "content-98.rrd"      
##  [4] "content-990.rrd"      "content-995.rrd"      "content-998.rrd"     
##  [7] "cpu-0.rrd"            "cpu-1.rrd"            "cpu-2.rrd"           
## [10] "cpu-3.rrd"            "license-users.rrd"    "network-eth0.rrd"    
## [13] "network-lo.rrd"       "system-CPU.rrd"       "system.cpu.usage.rrd"
## [16] "system.load.rrd"      "system.memory.rrd"    "system-RAM.rrd"      
## [19] "system.swap.rrd"      "system-SWAP.rrd"

The file names indicated that RStudio Connect collects metrics for the system (CPU, RAM, etc.), as well as for every piece of published content.

To look at the system load, first describe the contents of the "system.load.rrd" file:

sys_load <- file.path(rrd_location, "system.load.rrd")
describe_rrd(sys_load)

## A RRD file with 10 RRA arrays and step size 60
## [1] AVERAGE_60 (43200 rows)
## [2] AVERAGE_300 (25920 rows)
## [3] MIN_300 (25920 rows)
## [4] MAX_300 (25920 rows)
## [5] AVERAGE_3600 (8760 rows)
## [6] MIN_3600 (8760 rows)
## [7] MAX_3600 (8760 rows)
## [8] AVERAGE_86400 (1825 rows)
## [9] MIN_86400 (1825 rows)
## [10] MAX_86400 (1825 rows)

This output tells you that metrics are collected every 60 seconds (one minute), and then in selected multiples (1 minute, 5 minutes, 1 hour and 1 day.) You can also tell that the consolidation functions are average, min and max.

To extract one month of data, averaged at 5-minute intervals use step = 300:

dat <- read_rra(sys_load, cf = "AVERAGE", step = 300L, n_steps = (3600 / 300) * 24 * 30)
dat

## # A tibble: 8,640 x 4
##    timestamp            `1min` `5min` `15min`
##  * <dttm>                <dbl>  <dbl>   <dbl>
##  1 2018-05-10 19:10:00 0.0254  0.0214  0.05  
##  2 2018-05-10 19:15:00 0.263   0.153   0.0920
##  3 2018-05-10 19:20:00 0.0510  0.117   0.101 
##  4 2018-05-10 19:25:00 0.00137 0.0509  0.0781
##  5 2018-05-10 19:30:00 0       0.0168  0.0534
##  6 2018-05-10 19:35:00 0       0.01    0.05  
##  7 2018-05-10 19:40:00 0.0146  0.0166  0.05  
##  8 2018-05-10 19:45:00 0.00147 0.0115  0.05  
##  9 2018-05-10 19:50:00 0.0381  0.0306  0.05  
## 10 2018-05-10 19:55:00 0.0105  0.018   0.05  
## # ... with 8,630 more rows

It is very easy to plot this using your preferred plotting package, e.g., ggplot2:

ggplot(dat, aes(x = timestamp, y = `5min`)) + 
  geom_line() + 
  stat_smooth(method = "loess", span = 0.125)

Conclusion

The rrd package, available from GitHub, makes it very easy to read metrics stored in the RRD database format. Reading an archive is very quick, and your resulting data is a tibble for an individual archive, or a list of tibbles for the entire file.

This makes it easy to analyze your data using the tidyverse packages, and to plot the information.

Enterprise Dashboards with R Markdown

Wed, 16 May 2018 00:00:00 +0000

This is a second post in a series on enterprise dashboards. See our previous post, Enterprise-ready dashboards with Shiny Databases.

We have been living with spreadsheets for so long that most office workers think it is obvious that spreadsheets generated with programs like Microsoft Excel make it easy to understand data and communicate insights. Everyone in a business, from the newest intern to the CEO, has had some experience with spreadsheets. But using Excel as the de facto analytic standard is problematic. Relying exclusively on Excel produces environments where it is almost impossible to organize and maintain efficient operational workflows. In addition to fostering low productivity, organizations risk profits and reputations in an age where insightful analyses and process control translate to a competitive advantage. Most organizations want better control over accessing, distributing, and processing data. You can use the R programming language, along with with R Markdown reports and RStudio Connect, to build enterprise dashboards that are robust, secure, and manageable.

This Excel dashboard attempts to function as a real application by allowing its users to filter and visualize key metrics about customers. It took dozens of hours to build. The intent was to hand off maintenance to someone else, but the dashboard was so complex that the author was forced to maintain it. Every week, the author copied data from an ETL tool and pasted it into the workbook, spot checked a few cells, and then emailed the entire workbook to a distribution list. Everyone on the distribution list got a new copy in their inbox every week. There were no security controls around data management or data access. Anyone with the report could modify its contents. The update process often broke the brittle cell dependencies; or worse, discrepancies between weeks passed unnoticed. It was almost impossible to guarantee the integrity of each weekly report.

Why coding is important

Excel workbooks are hard to maintain, collaborate on, and debug because they are not reproducible. The content of every cell and the design of every chart is set without ever recording the author’s actions. There is no simple way to recreate an Excel workbook because there is no recipe (i.e., set of instructions) that describes how it was made. Because Excel workbooks lack a recipe, they tend to be hard to maintain and prone to errors. It takes care, vigilance, and subject-matter knowledge to maintain a complex Excel workbook. Even then, human errors abound and changes require a lot of effort.

A better approach is to write code. There are many reasons to start programming. When you create a recipe with code, anyone can reproduce your work (including your future self). The act of coding implicitly invites others to collaborate with you. You can systematically validate and debug your code. All of these things lead to better code over time. Coding in R has particular advantages given its vast ecosystem of packages, its vibrant community, and its powerful tool chain.

Using R Markdown

There are many tools for replacing complex Excel dashboards with R code. One of these tools is R Markdown, an open-source R package that turns your analyses into high quality documents, reports, presentations and dashboards. R Markdown documents are fully reproducible and support dozens of output formats including HTML, PDF, and Microsoft Word documents.

Here is the same Excel dashboard translated to an R Markdown report. Because this report is written in code, it is vastly simpler and easier to maintain. Like the Excel dashboard above, this R Markdown report is designed to take user inputs so that it could render custom report versions.

Many people are already aware that R Markdown reports combine narrative, code, and output in a single document. What is less commonly known is that you can generalize any R Markdown report by declaring parameters in the document header. R Markdown documents with parameters are known as parameterized reports. In the Excel dashboard users can select segment, group, and period. In a parameterized R Markdown document, you would specify these inputs with the following YAML header:

---
title: Customer Tracker Report
output: html_notebook
params:
  seg: 
    label: "Segment:"
    value: Total
    input: select
    choices: [Total, Heavy, Mainstream, Focus1, Focus2, 
              Specialty, Diverse1, Diverse2, Other, New]
  grp: 
    label: "Group:"
    value: Total
    input: select
    choices: [Total, Core, Extra]
  per: 
    label: "Period:"
    value: Week
    input: radio
    choices: [Week, YTD]
---

You can then call the parameters you declare in the YAML header from your R code chunks.

```{r}
params$segment
params$grp
params$per
```

You can render the document with different inputs by selecting knit with parameters in RStudio. This option will open a user interface that allows you to select the parameters you want.

If you want to automate the process of creating custom report versions, you can render these documents programmatically with the rmarkdown::render() function.

rmarkdown::render(
  input = "tracker-report.Rmd", 
  params = list(seg = "Focus1", grp = "Core", per = "Weekly")
)

Publishing to RStudio Connect

Managing access and permissions for an ocean of Excel files is painful. Data in Excel spreads through an organization without controls like a virus spreads through a body without disease prevention. There are better ways to secure the operation, access, and distribution of information.

RStudio Connect is a server product from RStudio that is designed for secure sharing of R content. It is on-premises software you run behind your firewall. You keep control of your data and of who has access. With RStudio Connect, you can see all your content, decide who should be able to view and collaborate on it, tune performance, schedule updates, and view logs. You can schedule your R Markdown reports to run automatically or even distribute the latest version by email.

When you publish a parameterized R Markdown report to RStudio Connect, an interface appears for selecting inputs. Viewers can create new report versions, then email themselves a copy. Collaborators can save and schedule new report versions, then email others a copy. You can even attach output files to these versions. Using parameterized R Markdown documents in RStudio Connect is a powerful way to communicate information.

You can publish content from the RStudio IDE by clicking the Publish button that looks like a blue Eye of Horus. Pressing this button will begin the publishing process. First, it creates a set of instructions for recreating your content. Second, it deploys your content bundle to the server. Third, it recreates your content on RStudio Connect. Push-button publishing has a long history of being used with RStudio. In 2012, RStudio enabled push-button publishing of R Markdown documents to RPubs. In 2014, RStudio enabled push-button publishing of Shiny apps to shinyapps.io. In 2016, RStudio enabled push-button publishing to RStudio Connect.

Adding Shiny

R Markdown documents are rendered with batch processing. That makes them ideal for automation, long running workflows, and custom report versions. However, if you want your documents to be immediately reactive to user input, then you can add a Shiny runtime. These interactive documents behave like a Shiny application in that they must be hosted. You can host interactive documents and Shiny applications with RStudio Connect. Deciding when to choose between R Markdown, interactive documents, and Shiny applications is a subject for a later post.

Summary

Reproducible code in R leads to better analysis and collaboration. You can use parameterized R Markdown reports to create complex, interactive dashboards. Hosting these dashboards securely in RStudio Connect gives you control over accessing, distributing, and processing data. You can use the R programming language, along with with R Markdown reports and RStudio Connect, to build enterprise dashboards that are robust, secure, and manageable.

Click here for source code.

Multiple Versions of R

Wed, 21 Mar 2018 00:00:00 +0000

Data scientists prefer using the latest R packages to analyze their data. To ensure a good user experience, you will need a recent version of R running on a modern operating system. If you run R on a production server – and especially if you use RStudio Connect – plan to support multiple versions of R side by side so that your code, reports, and apps remain stable over time. You can support multiple versions of R concurrently by building R from source. Plan to install a new version of R at least once per year on your servers.

A solid foundation for R

Administering R on the desktop is relatively easy, because desktops are designed for a single user at a specific time. Desktop users upgrade R versions and R packages as new software becomes available, leaving old versions and packages behind. Servers, on the other hand, are designed to support multiple people who want to access content across time. Servers are increasingly used for building data science labs in R, deploying R in production, and running R in the cloud. You may find that the same strategies you use to administer R on your desktop do not work as well on a server. In particular, upgrading your version of R must be handled differently.

If you upgrade R on your server as you do on your desktop, you could easily break some apps and disrupt your teams. Administrators should exercise caution when upgrading to a new version of R on a Linux server. Consider the following situations:

You are hosting apps on RStudio Connect and Shiny Server for more than a year. When you upgrade R, you break many of your older apps.
Your team is developing code on a shared instance of RStudio Server. When you upgrade R, you disrupt people’s work and break their code.

Instead of upgrading your existing version of R, a better solution to these problems is to run multiple versions of R side by side. This strategy preserves past versions of R so you can manage upgrades and keep your code, apps, and reports stable over time.

Building R from source

The best way to run multiple versions of R side by side is to build R from source. If you are running R on a Linux server – and particularly in the enterprise – you should always build R from source, because it will help you:

Run multiple versions of R side by side
Guarantee that R will work on your unique server configuration
Potentially speed up certain low-level computations used by R
Build technical expertise that will help you administer R at scale

Most enterprise IT departments will be comfortable building software from source. If you have never built R from source, it is very straightforward. First, you need the build dependencies for R. If you’ve already installed R from a binary source like CRAN or EPEL, you may already have these dependencies installed; otherwise, you can run sudo yum-builddep R on RedHat or sudo apt-get build-dep r-base on Ubuntu. Second, you should obtain and unpack the source tarball for the version of R you want to install from CRAN. Third, from within the extracted source directory, build R from source using configure, make, and make install commands. For example:

# BUILD R FROM SOURCE ON REDHAT LINUX
# R-3.4.3

# Install Linux dependencies
$ sudo yum-builddep R

# Download and extract source code
$ wget https://cran.r-project.org/src/base/R-3/R-3.4.3.tar.gz
$ tar -xzvf R-3.4.3.tar.gz
$ cd R-3.4.3

# Build R from source
$ ./configure --prefix=/opt/R/$(cat VERSION) --enable-R-shlib --with-blas --with-lapack
$ make
$ sudo make install

This script installs R version 3.4.3 into /opt/R/3.4.3, but you can install R into any of the recommended directories. The --enable-R-shlib option is required to make the shared libraries known to RStudio. The --with-blas and --with-lapack options are not required, but are commonly included. These options install the system BLAS and LAPACK libraries, which are used to speed up certain low-level math computations (e.g., multiplying and inverting matrices). These libraries will not speed up R itself, but can significantly speed up the underlying code execution.

If you run into problems installing R from source, you can always remove the installation directory and start over. However, once the installation succeeds, you should never move the installation directory – in other words, always install into the final destination directory. If you run into problems with dependencies, make sure you are able to identify and install all of the required Linux libraries (e.g., the X11 library is commonly overlooked). Building R from source will be much easier with a modern operating system that is connected to the Internet.

For further details about building R from source, see the RStudio Server Admin Guide.

RStudio professional products

RStudio professional products automatically support multiple versions of R and provide additional features, such as having administrators control access to multiple versions, or allowing users to choose for themselves. RStudio Connect automatically provides R version matching. Running multiple versions of R side by side with RStudio Connect will ensure that your content persists over time.

References

Deep learning at rstudio::conf 2018

Wed, 14 Feb 2018 00:00:00 +0000

Two weeks ago, rstudio::conf 2018 was held in San Diego. We had 1,100 people attend the sold-out event. In this post, I summarize my experience of the talks on the topic of deep learning with R, including the keynote by J.J. Allaire.

Keynote

The keynote on the second day was J.J. Allaire discussing “Machine Learning with Tensorflow and R”. In this talk, J.J. took us on a tour of how to use TensorFlow with R. He started with the basics, e.g., what is a tensor (it’s an array), and explained how the tensors “flow” in a computation graph in the TensorFlow library. The tensorflow package in R is an interface to the TensorFlow library, meaning you can access the full power of TensorFlow directly from R.

For several years, there has been a great deal of hype about deep learning, with multiple libraries (primarily written in Python and C++). Of these libraries, TensorFlow seems to get the dominant share of interest. R has always been a language that excels in its ability to interact with other languages, including Fortran, C++, and now Python. With the release of the tensorflow package, R users can make full use of all of the functions in TensorFlow.

Advances in deep learning, including algorithms, GPU computing, and availability of large data sets, have combined for the enormous success of deep learning in many fields. This includes near-human-level performance in the fields of image classification, speech recognition, and machine translation, to name a few.

However, J.J. points out that TensorFlow is quite a low-level mathematical library, and that most practitioners would benefit from writing their neural network code using keras, a package that exposes a high-level API. Keras supports multiple back ends, including TensorFlow, CNTK and Theano. You can find out more at the keras package page.

J.J. concluded his talk by demonstrating several ways to deploy a keras or tensorflow model, including publishing to RStudio Connect.

To find out more about J.J.’s talk, you can watch the keynote video or view the slides. You can also download the keras cheat sheet.

Other talks

Following the keynote, the conference split into several tracks. I attended the session1: “interop”, which focused on interoperability between R and several deep-learning frameworks, including deployment options.

The first talk in this session was by Michael Quinn from Google. Michael discussed “large-scale machine learning using TensorFlow, BigQuery and Cloud ML”. Once you have a keras or tensorflow model, you can deploy this to Google Cloud Machine Learning (Cloud ML). What I find interesting about this is that Cloud ML is a service designed for machine learning. Using this service, you can deploy models without having to stand up a virtual machine first. You can do this deployment using R code with the cloudml package
The next talk was by Javier Luraschi from RStudio, who spoke about “Deploying TensorFlow models with tfdeploy”. The tfdeploy package exposes a unified way to deploy models to several platforms, including TensorFlow Serving, Cloud ML, and RStudio Connect. Javier made his talk available as slides as well as code.
The final presentation was by Ali Zaid from Microsoft, who talked about “Reinforcement learning in Minecraft with CNTK-R”. Ali showed how he trained a deep-learning model to control an agent in Minecraft, the popular online game. In his experiment, he taught the agent to navigate a maze, as well as understand the some natural language, e.g., “Pick up the red flowers”. He used the CNTK-R package, which wraps the Microsoft Cognitive Toolkit (CNTK).

Conclusion

In conclusion, I’ll quote directly from J.J. Allaire’s keynote, in which he describes the key takeaways:

TensorFlow is a new general-purpose numerical-computing library with lots to offer the R community.
Deep learning has made great progress and will likely increase in importance in various fields in the coming years.
R now has a great set of APIs and supporting tools for using TensorFlow and doing deep learning.

Package Management for Reproducible R Code

Thu, 18 Jan 2018 00:00:00 +0000

Any programming environment should be optimized for its task, and not all tasks are alike. For example, if you are exploring uncharted mountain ranges, the portability of a tent is essential. However, when building a house to weather hurricanes, investing in a strong foundation is important. Similarly, when beginning a new data science programming project, it is prudent to assess how much effort should be put into ensuring the code is reproducible.

Note that it is certainly possible to go back later and “shore up” the reproducibility of a project where it is weak. This is often the case when an “ad-hoc” project becomes an important production analysis. However, the first step in starting a project is to make a decision regarding the trade-off between the amount of time to set up the project and the probability that the project will need to be reproducible in arbitrary environments.

Challenges

It is important to understand the reasons that reproducible programming is challenging. Once programming practices and external data are taken into account, the primary difficulty is dependency management over time. Dependency management is important because dependencies are so essential to R development. R has a fast-moving community and many extremely valuable packages to make your work more effective and efficient.

You will typically want to ensure that you are using recent versions of packages for a new project. By extension, this will require a recent operating system and a recent version of R.

The best place to start is with a recent operating system and a recent version of R

Typically, this equates to upgrading R to the latest version once or twice per year, and upgrading your operating system to a new major version every two to three years.

Despite the upsides of a vibrant package ecosystem, R programmers are familiar with the pain that can come with the many (very useful) packages that change, break, and are deprecated over time. Good dependency management ensures your project can be recomputed again in another time or another place.

Solutions

R package management is where most reproducibility decision-making needs to happen, although we will mention system dependencies shortly. CRAN archives source code for all versions of R packages, past and present. As a result, it is always possible to rebuild from source for package versions that you used to build an analysis (even on different operating systems). How you keep track of the dependencies that you used will establish how reproducible your analysis is. As we indicated before, there is a spectrum along which you might fall.

Ignoring Reproducibility

There are occasionally times of rapid exploration where the simplest solution is to ignore reproducibility.

Many R developers opt for a single massive system library of R packages and no record of what packages they used for an analysis. It is still recommended to use “RStudio Projects”, if you are using the RStudio IDE, and version control code in git or some other version-control system. This approach is optimal for exploring because it involves almost no setup, and gets the programmer into the problem immediately.

However, even with code version control, it can be very challenging to reproduce a result without documentation of the package versions that were in use when the code was checked in. Further, if one project updates a package that another project was using, it is possible to have the two projects conflict on version dependencies, and one or both can break.

When exploration begins to stabilize, it is best to establish a reproducible environment. You can always capture dependencies at a given time with sessionInfo() or devtools::session_info, but this does not facilitate easily rebuilding your dependency tree.

Tracking Package Dependencies per Project

Tracking dependencies per project isolates package versions at a project level and avoids using the system library. packrat and checkpoint/MRAN both take this approach, so we will discuss each separately.

Programmers in other languages will be familiar with packrat’s approach to storing the exact versions of packages that the project uses in a text file (packrat.lock). It works for CRAN, GitHub, and local packages, and provides a high level of reproducibility. However, a fair amount of time is spent building packages from source, re-installing packages into the local project’s folder, and downloading the source code for packages. Fortunately, packrat has a “global cache” that can speed things up by symlinking package versions that have been installed elsewhere on the system.
MRAN and checkpoint also take the library-per-project approach, but focus on CRAN packages and determine dependencies based on the “snapshot” of CRAN that Microsoft stored on a given day. The programmer need only store the “checkpoint” day they are referencing to keep up with package versions.

Both packages leverage up-front work to make reproducing an analysis quite straightforward later, but it is worth noting the differences between them.

packrat keeps tabs on the packages installed in your project folder and presumes that they form a complete, working, and self-consistent whole. It also downloads package sources to your computer for future re-compiling, if necessary.
checkpoint chooses package versions based on a given day in MRAN history. This presumes that all the package versions you need are on that day, and that CRAN had a self-consistent system that day. If you want to update a package, you will need to choose a date in the future and re-install all packages to be sure that none of them break.

Tracking All Dependencies per Project

When it comes to other system libraries or dependencies, containers are one of the most popular solutions for reproducibility. Containers behave like lightweight virtual machines, and are more fitting for reproducible data science. To give containers a shot, you can install docker and then take a look at the rocker project (R on docker).

At a high level, Docker saves a snapshot called an “image” that includes all of the software necessary to complete a task. A running “image” is called a “container.” These images are extensible, so that you can more easily build an image that has the dependencies you need for a given project. For instance, to use the tidyverse, you might execute the following:

docker pull rocker/tidyverse
docker run -d --name=my-r-container -p 8787:8787 rocker/tidyverse

You can then get an interactive terminal with docker exec -it my-r-container bash, or open RStudio in the browser by going to localhost:8787 and authenticating with user:pass rstudio:rstudio.

It is important to consider the difficulty of maintaining package dependencies within the image. If your Dockerfile installs packages from CRAN or GitHub, the regeneration of your image will still be susceptible to changes in the published version of a package. As a result, it is advisable to pair up packrat with Docker for complete dependency management.

A simple Dockerfile like the following will copy the current project folder into the rstudio user’s home (within the container) and install the necessary dependencies using packrat. It requires using packrat for the project.

FROM rocker/rstudio

# install packrat
RUN R -e 'install.packages("packrat", repos="http://cran.rstudio.com", dependencies=TRUE, lib="/usr/local/lib/R/site-library");'

USER rstudio

# copy lock file & install deps
COPY --chown=rstudio:rstudio packrat/packrat.* /home/rstudio/project/packrat/
RUN R -e 'packrat::restore(project="/home/rstudio/project");'

# copy the rest of the directory
# .dockerignore can ignore some files/folders if desirable
COPY --chown=rstudio:rstudio . /home/rstudio/project

USER root

Then the following will get your image started, much like the tidyverse example above.

docker build --tag=my-test-image .
docker run --rm -d --name=my-test-container -p 8787:8787 my-test-image

Note that doing more complex work typically involves a bit of foresight, familiarity with design conventions, and the creation of a custom Dockerfile. However, this up-front work is rewarded by a full operating-system snapshot, including all system and package dependencies. As a result, Docker provides optimal reproducibility for an analysis.

How certain do you need to be that your code is reproducible?

It is necessary and increasingly popular to start thinking about notebooks when discussing reproducibility. However, if the aim is to recompute results in another time or place, we cannot stop there.

When it comes to the management of packages and other system dependencies, you will need to decide whether you want to spend more time setting up a reproducible environment, or if you want to start exploring immediately. Whether you are putting up a tent for the night or building a house that future generations will enjoy, there are plenty of tools to help you on your way and assist you if you ever need to change course.

In future posts, I hope to explore additional aspects of reproducibility.

A Data Science Lab for R

Wed, 20 Dec 2017 00:00:00 +0000

In a previous post I described the role of analytic administrator as a data scientist who: onboards new tools, deploys solutions, supports existing standards, and trains other data scientists. In this post I will describe how someone in that role might set up a data science lab for R.

Architecture

A data science lab is an environment for developing code and creating content. It should enhance the productivity of your data scientists and integrate with your existing systems. Your data science lab might live on your premises or in the cloud. It might be built with hardware, virtual machines, or containers. You may use it to support a single data scientist or hundreds of R developers. Here is one reference architecture of a data science lab based on server instances.

Key components of this setup include: authentication; load balancing; a testing environment; data connectivity; and a publishing platform. In this server-based architecture, data scientists use a web browser to access the data science lab. High performance compute and live data reside securely behind a firewall.

Instance Sizing

The size of your server instance depends on how many concurrent sessions you run and how large your sessions are. Keep in mind that R is single threaded by default and holds data in memory. Here is a list of example server sizes:

Instance Size	Cores	RAM	Description
Minimum recommended	2	4G	This server will be for lightweight jobs, testing, and sandboxing.
Small	4	8G	This server will support one or two analysts with small data.
Large	16	256G	This server will support 15 analysts with a blend of large and small sessions. Alternatively, it will support dozens of analysts with small sessions.
Jumbo	32+	1T+	May be useful for heavier workloads.

Open-Source R

If you haven’t done so already, I recommend you make R a legitimate part of your organization by officially recognizing it as an analytic standard. You should be familiar with installing and managing R and its packages.

You can install R as a pre-compiled binary from a repository, or you can install R from source. Installing R from source allows you to install multiple versions of R side by side. If you compile R from source, I recommend you link to the BLAS libraries so that you can speed up certain low-level math computations.

Data science labs tend to require a modern toolkit. You should expect to upgrade R at least once a year. You should also keep your operating system up to date. New and improved R packages tend to work better when you use them with recent versions of R and updated system libraries.

RStudio Server Pro

Building a data science lab involves installing, configuring, and managing tools. In this section I will describe how to administer RStudio Server Pro which has features for authentication, security, and admin controls.

1. Installation

Once you have installed R, you can install RStudio Server Pro by downloading the binaries and following the instructions. You will need root privileges to install and run the software. You will also need to create local system accounts for all of your R developers.

2. Configuration

Authentication. The first thing you will want to do after you install RStudio Server Pro is to configure it with your authentication system. RStudio Server Pro supports LDAP via PAM sessions. If you use single sign on or another system, you can configure RStudio Server Pro to work in proxied auth mode. You can also authenticate via Google accounts and local system accounts.

Data Connectivity. Most data scientists use R with databases. The RStudio Pro Drivers are ODBC drivers that will connect R to some of the most popular databases today. These drivers are a free add-on for RStudio Server Pro. If you are using a data source that is not supported, or if you are using the open source version of RStudio Server, you can bring your own ODBC driver.

Load Balancing. If you want to load balance your server instances, you can use the load balancer that is built into RStudio Server Pro or you can bring your own load balancer. Load balancing is designed to balance user sessions seamlessly across the cluster and provide high availability. It requires a shared home drive that is mounted to each one of the instances.

More Features. RStudio Server Pro has a list of features that you can configure. You should decide which features you want to enable or disable. For more information on configuring each of these features, see the admin guide.

Feature	Description
Authentication	LDAP, Active Directory, Google Accounts and system accounts Full support for Pluggable Authentication Modules, Kerberos via PAM, and custom authentication via proxied HTTP header
Data Connectivity	RStudio Professional Drivers are ODBC data connectors that help you connect to some of the most popular databases.
Load Balancing	Load balance R sessions across two or more servers Ensure high availability using multiple masters
Enhanced security	Encrypt traffic using SSL and restrict client IP addresses
Administrative dashboard	Monitor active sessions and their CPU and memory utilization Suspend, forcibly terminate, or assume control of any active session Review historical usage and server logs
Auditing and monitoring	Monitor server resources (CPU, memory, etc.) on both a per-user and system-wide basis Send metrics to external systems with the Graphite/Carbon plaintext protocol Health check with configurable output (custom XML, JSON) Audit all R console activity by writing input and output to a central location
Advanced R session management	Tailor the version of R, reserve CPU, prioritize scheduling and limit resources by User and Group Provision accounts and mount home directories dynamically via the PAM Session API Automatically execute per-user profile scripts for database and cluster connectivity
Project sharing	Share projects & edit code files simultaneously with others

3. Management

Once RStudio Server Pro is installed and configured, you’ll need to manage it over time. RStudio Server Pro comes with a variety of tools for workspace and server management that will help keep your environment organized. For example, you can kill sessions, set session timeouts, and broadcast notifications to user sessions in real-time. You can manage product licenses for both online and offline environments. If your instances start and stop frequently you can opt for using a floating license manager.

Next Steps

Your data science lab for R should be designed to scale. That might mean adding more people, more systems, or more tools. It also might mean creating more content. Shiny is an R package that makes it easy to build interactive web apps straight from R. R Markdown is an R package that makes it easy to author reports and build dashboards. You can publish your Shiny apps or R Markdown reports with the push of a button to RStudio Connect. RStudio Connect lets you share and manage content in one convenient place. You can also publish Shiny apps to shinyapps.io, which allows you to share your Shiny apps online.

Using Shiny with Scheduled and Streaming Data

Wed, 15 Nov 2017 00:00:00 +0000

Note: This article is now several years old. If you have RStudio Connect, there are more modern ways of updating data in a Shiny app.

Shiny applications are often backed by fluid, changing data. Data updates can occur at different time scales: from scheduled daily updates to live streaming data and ad-hoc user inputs. This article describes best practices for handling data updates in Shiny, and discusses deployment strategies for automating data updates.

This post builds off of a 2017 rstudio::conf talk. The recording of the original talk and the sample code for this post are available.

The end goal of this example is a dashboard to help skiers in Colorado select a resort to visit. Recommendations are based on:

Snow reports that provide useful metrics like number of runs open and amount of new snow. Snow reports are updated daily.
Weather data, updated in near real-time from a live stream.
User preferences, entered in the dashboard.

The backend for the dashboard looks like:

Automate Scheduled Data Updates

The first challenge is preparing the daily data. In this case, the data preparation requires a series of API requests and then basic data cleansing. The code for this process is written into an R Markdown document, alongside process documentation and a few simple graphs that help validate the new data. The R Markdown document ends by saving the cleansed data into a shared data directory. The entire R Markdown document is scheduled for execution.

It may seem odd at first to use a R Markdown document as the scheduled task. However, our team has found it incredibly useful to be able to look back through historical renderings of the “report” to gut-check the process. Using R Markdown also forces us to properly document the scheduled process.

We use RStudio Connect to easily schedule the document, view past historical renderings, and ultimately to host the application. If the job fails, Connect also sends us an email containing stdout from the render, which helps us stay on top of errors. (Connect can optionally send the successfully rendered report, as well.) However, the same scheduling could be accomplished with a workflow tool or even CRON.

Make sure the data, written to shared storage, is readable by the user running the Shiny application - typically a service account like rstudio-connect or shiny can be set as the run-as user to ensure consistent behavior.

Alternatively, instead of writing results to the file system, prepped data can be saved to a view in a database.

Using Scheduled Data in Shiny

The dashboard needs to look for updates to the underlying shared data and automatically update when the data changes. (It wouldn’t be a very good dashboard if users had to refresh a page to see new data.) In Shiny, this behavior is accomplished with the reactiveFileReader function:

daily_data <- reactiveFileReader(
  intervalMillis = 100,
  filePath       = 'path/to/shared/data',
  readFunc       = readr::read_cs
)

The function checks the shared data file’s update timestamp every intervalMillis to see if the data has changed. If the data has changed, the file is re-read using readFunc. The resulting data object, daily_data, is reactive and can be used in downstream functions like render***.

If the cleansed data is stored in a database instead of written to a file in shared storage, use reactivePoll. reactivePoll is similar to reactiveFileReader, but instead of checking the file’s update timestamp, a second function needs to be supplied that identifies when the database is updated. The function’s help documentation includes an example.

Streaming Data

The second challenge is updating the dashboard with live streaming weather data. One way for Shiny to ingest a stream of data is by turning the stream into “micro-batches”. The invalidateLater function can be used for this purpose:

liveish_data <- reactive({
  invalidateLater(100)
  httr::GET(...)
})

This causes Shiny to poll the streaming API every 100 milliseconds for new data. The results are available in the reactive data object liveish_data. Picking how often to poll for data depends on a few factors:

Does the upstream API enforce rate limits?
How long does a data update take? The application will be blocked while it polls data.

The goal is to pick a polling time that balances the user’s desire for “live” data with these two concerns.

Conclusion

To summarize, this architecture provides a number of benefits: No more painful, manual running of R code every day! Dashboard code is isolated from data prep code. There is enough flexibility to meet user requirements for live and daily data, while preventing un-necessary number crunching on the backend.

Database Queries With R

Wed, 18 Oct 2017 00:00:00 +0000

There are many ways to query data with R. This post shows you three of the most common ways:

Using DBI
Using dplyr syntax
Using R Notebooks

Background

Several recent package improvements make it easier for you to use databases with R. The query examples below demonstrate some of the capabilities of these R packages.

DBI. The DBI specification has gone through many recent improvements. When working with databases, you should always use packages that are DBI-compliant.
dplyr & dbplyr. The dplyr package now has a generalized SQL backend for talking to databases, and the new dbplyr package translates R code into database-specific variants. As of this writing, SQL variants are supported for the following databases: Oracle, Microsoft SQL Server, PostgreSQL, Amazon Redshift, Apache Hive, and Apache Impala. More will follow over time.
odbc. The odbc R package provides a standard way for you to connect to any database as long as you have an ODBC driver installed. The odbc R package is DBI-compliant, and is recommended for ODBC connections.

RStudio also made recent improvements to its products so they work better with databases.

RStudio IDE (v1.1). With the latest version of the RStudio IDE, you can connect to, explore, and view data in a variety of databases. The IDE has a wizard for setting up new connections, and a tab for exploring established connections. These new features are extensible and will work with any R package that has a connections contract.
RStudio Professional Drivers. If you are using RStudio professional products, you can download RStudio Professional Drivers for no additional cost. The examples below use the Oracle ODBC driver. If you are using open-source tools, you can bring your own driver or use community packages – many open-source drivers and community packages exist for connecting to a variety of databases.

Using databases with R is a broad subject and there is more work to be done. An earlier blog post discussed our vision. Part of that vision was to create a website where you can find everything about databases and R in one place. To learn more, visit our site at db.rstudio.com.

Example: Query bank data in an Oracle database

In this example, we will query bank data in an Oracle database. We connect to the database by using the DBI and odbc packages. This specific connection requires a database driver and a data source name (DSN) that have both been configured by the system administrator. Your connection might use another method.

library(DBI)
library(dplyr)
library(dbplyr)
library(odbc)
con <- dbConnect(odbc::odbc(), "Oracle DB")

1. Query using `DBI`

You can query your data with DBI by using the dbGetQuery() function. Simply paste your SQL code into the R function as a quoted string. This method is sometimes referred to as pass through SQL code, and is probably the simplest way to query your data. Care should be used to escape your quotes as needed. For example, 'yes' is written as \'yes\'.

dbGetQuery(con,'
  select "month_idx", "year", "month",
  sum(case when "term_deposit" = \'yes\' then 1.0 else 0.0 end) as subscribe,
  count(*) as total
  from "bank"
  group by "month_idx", "year", "month"
')

2. Query using dplyr syntax

You can write your code in dplyr syntax, and dplyr will translate your code into SQL. There are several benefits to writing queries in dplyr syntax: you can keep the same consistent language both for R objects and database tables, no knowledge of SQL or the specific SQL variant is required, and you can take advantage of the fact that dplyr uses lazy evaluation. dplyr syntax is easy to read, but you can always inspect the SQL translation with the show_query() function.

q1 <- tbl(con, "bank") %>%
  group_by(month_idx, year, month) %>%
  summarise(
    subscribe = sum(ifelse(term_deposit == "yes", 1, 0)),
    total = n())
show_query(q1)

<SQL>
SELECT "month_idx", "year", "month", SUM(CASE WHEN ("term_deposit" = 'yes') THEN (1.0) ELSE (0.0) END) AS "subscribe", COUNT(*) AS "total"
FROM ("bank") 
GROUP BY "month_idx", "year", "month"

3. Query using an R Notebooks

Did you know that you can run SQL code in an R Notebook code chunk? To use SQL, open an R Notebook in the RStudio IDE under the File > New File menu. Start a new code chunk with {sql}, and specify your connection with the connection=con code chunk option. If you want to send the query output to an R dataframe, use output.var = "mydataframe" in the code chunk options. When you specify output.var, you will be able to use the output in subsequent R code chunks. In this example, we use the output in ggplot2.

```{sql, connection=con, output.var = "mydataframe"}
SELECT "month_idx", "year", "month", SUM(CASE WHEN ("term_deposit" = 'yes') THEN (1.0) ELSE (0.0) END) AS "subscribe",
COUNT(*) AS "total"
FROM ("bank") 
GROUP BY "month_idx", "year", "month"
```

```{r}
library(ggplot2)
ggplot(mydataframe, aes(total, subscribe, color = year)) +
  geom_point() +
  xlab("Total contacts") +
  ylab("Term Deposit Subscriptions") +
  ggtitle("Contact volume")
```

The benefits to using SQL in a code chunk are that you can paste your SQL code without any modification. For example, you do not have to escape quotes. If you are using the proverbial spaghetti code that is hundreds of lines long, then a SQL code chunk might be a good option. Another benefit is that the SQL code in a code chunk is highlighted, making it very easy to read. For more information on SQL engines, see this page on knitr language engines.

Summary

There is no single best way to query data with R. You have many methods to chose from, and each has its advantages. Here are some of the advantages using the methods described in this article.

Method	Advantages
DBI::dbGetQuery	Fewer dependencies required
dplyr syntax	Use the same syntax for R and database objects No knowledge of SQL required Code is standard across SQL variants Lazy evaluation
R Notebook SQL engine	Copy and paste SQL – no formatting required SQL syntax is highlighted

You can download the R Notebook for these examples here.

Enterprise-ready dashboards with Shiny and databases

Wed, 20 Sep 2017 00:00:00 +0000

Inside the enterprise, a dashboard is expected to have up-to-the-minute information, to have a fast response time despite the large amount of data that supports it, and to be available on any device. An end user may expect that clicking on a bar or column inside a plot will result in either a more detailed report, or a list of the actual records that make up that number. This article will cover how to use a set of R packages, along with Shiny, to meet those requirements.

The code

A working example for the dashboard pictured above is available here: Flights Dashboard. The example has all of the functionality that is discussed in this article, except the database connectivity. The code for the dashboard is available in this Gist: app.R

The code for the dashboard that actually connects to a database is available in this Gist: app.R

Begin with `shinydashboard`

The shinydashboard package has three important advantages:

Provides an out-of-the-box framework to create dashboards in Shiny. This saves a lot of time, because the developer does not have to create the dashboard features manually using “base” Shiny.
Has a dashboard-firendly tag structure. This allows the developer to get started quickly. Inside the dashboardPage()tag, the dashboardHeader(), dashboardSidebar() and dashboardBody() can be added to easily lay out a new dashboard.
It is mobile-ready. Without any additional code, the dashboard layout will adapt to a smaller screen automatically.

Quick example

If you are new to shinydashboard, please feel free to copy and paste the following code to see a very simple dashboard in your environment:

library(shinydashboard)
library(shiny)
ui <- dashboardPage(
  dashboardHeader(title = "Quick Example"),
  dashboardSidebar(textInput("text", "Text")),
  dashboardBody(
    valueBox(100, "Basic example"),
    tableOutput("mtcars")
  )
)
server <- function(input, output) {
  output$mtcars <- renderTable(head(mtcars))
}
shinyApp(ui, server)

Deploy using `config`

It is very common that credentials used during development will not be the same ones used for publishing. For databases, the best way to accommodate for this is to have a Data Source Name (DSN) with the same alias name set up on both environments. If it is not possible to set up DSNs, then the config package can be used to make the switch between credentials used in the different environments invisible. The RStudio Connect product supports the use of the config package out of the box. Another advantage of using config, in lieu of Kerberos or DSN, is that the credentials used will not appear in the plain text of the R code. A more detailed write-up is available in the Make scripts portable article.

This code snippet is an example YAML file that config is able to read. It has one driver name for local development, and a different name for use during deployment:

default:
  mssql:
      Driver: "SQL Server"
      Server: "[server's path]"
      Database: "[database name]"
      UID: "[user id]"
      PWD: "[pasword]"
      Port: 1433
rsconnect:
  mssql:
      Driver: "SQLServer"
      Server: "[server's path]"
      Database: "[database name]"
      UID: "[user id]"
      PWD: "[pasword]"
      Port: 1433

The default setting will be automatically used when development, and RStudio Connect will use the rsconnect values when executing this code:

dw <- config::get("mssql")
con <- DBI::dbConnect(odbc::odbc(),
                      Driver = dw$Driver,
                      Server = dw$Server,
                      UID    = dw$UID,
                      PWD    = dw$PWD,
                      Port   = dw$Port,
                      Database = dw$Database)

Populate Shiny inputs using `purrr`

It is very common for Shiny inputs to retrieve their values from a table or a query. Because other queries in the dashboard will use the selected input to filter accordingly, the value required to pass to the other queries is normally an identification code, and not the label displayed in the drop down. To separate the keys from the values, the map() function in the purrr package can be used. In the example below, all of the records in the airlines table are collected, and a list of names is created, map() is then used to insert the carrier codes into each name node.

# This code runs in ui
airline_list <- tbl(con, "airlines") %>%
  collect  %>%
  split(.$name) %>%    # Place here the field that will be used for the labels
  map(~.$carrier)      # Place here the field that will be used for keys

The selectInput() drop-down menu is able to read the resulting airline_list list variable.

# This code runs in ui
 selectInput(
    inputId = "airline",
    label = "Airline:", 
    choices = airline_list) # Use airline_list as the choices argument value

Take advantage of `dplyr`’s “laziness”

Dashboards normally have a common data theme, which is sourced with a common data set. A base query can be built because dplyr translates into SQL under the covers and, due to “laziness”, doesn’t evaluate the query until something is requested from it.

db_flights <- tbl(con, "flights") %>%
  left_join(tbl(con, "airlines"), by = "carrier") %>%
  rename(airline = name) %>%
  left_join(tbl(con, "airports"), by = c("origin" = "faa")) %>%
  rename(origin_name = name) %>%
  select(-lat, -lon, -alt, -tz, -dst) %>%
  left_join(tbl(con, "airports"), by = c("dest" = "faa")) %>%
  rename(dest_name = name)

The dplyr variable can then be used in more than one Shiny output. A second example is in the code used to build the highcharter plot below.

output$total_flights <- renderValueBox({

  result <- db_flights %>%           # Use the db_flights variable
    filter(carrier == input$airline)
  if(input$month != 99) result <- filter(result, month == input$month)
  
  result <- result %>%
    tally %>%
    pull %>%                        # Use pull to get the total count as a vector
    as.integer()
  
  valueBox(value = prettyNum(result, big.mark = ","),
           subtitle = "Number of Flights")
})

Drill down

The idea of a “drill-down” action is that the end user is able to see part or all of the data that makes up the aggregate result displayed in the dashboard. A “drill-down” action has two parts:

A dashboard element that displays a result is clicked. The result is usually aggregate data.
A new screen is displayed with another report. The new report could be another report showing a lower-level aggregation, or it could show a list of rows that make up the result.

A dashboard element is clicked

The following is one way that capturing a click event is possible. The idea is to display the top airport destinations for a given airline in a bar plot. When a bar is clicked, the desired result is for the plot to activate a drill-down. The highcharter package will be used in this example.

To capture a bar-click event in highcharter, a small JavaScript needs to be written. The example below could be used in most cases, so you can copy and paste it as-is into your code. The variable name and the input name (bar_clicked) would be the only two statements that would have to be changed to match your chart.

 js_bar_clicked <- JS("function(event) {Shiny.onInputChange('bar_clicked', [event.point.category]);}")

The command above creates a new JavaScript inside R that makes it possible to track when a bar is clicked. Here is a breakdown of the code:

JS - Indicates that the following function is JavaScript
function(event) - Creates a new function, and expect an event variable. The event that Highchart will pass is when a bar is clicked on, so the event will contain information about that given bar.
Shiny.onInputChange - Is the function that JavaScript will use to interact with Shiny
bar_clicked - Is the name of a new Shiny input; its value will default to the next item
[event.point.category] - Passes the category value of the point where the click was made

The next section will illustrate how to capture the change of the new input$bar_clicked, and perform the second part of the “drill down”.

In the renderHighchart() output function, the variable that contains the JavaScript is passed as part of a list of events: events = list(click = js_bar_clicked)). Because the event is inside the hc_add_series() that creates the bar plot, then such click-event is tied to when the bar is clicked.

output$top_airports <- renderHighchart({
  # Reuse the dplyr db_flights variable as the base query
  result <- db_flights %>%
    filter(carrier == input$airline) 
  if(input$month != 99) result <- filter(result, month == input$month) 
  result <- result %>
    group_by(dest_name) %>%
    tally() %>%
    arrange(desc(n)) %>%                          
    collect %>%
    head(10)                                      
  highchart() %>%
    hc_add_series(
      data = result$n, 
      type = "bar",
      name = paste("No. of Flights"),
      events = list(click = js_bar_clicked)) %>%   # The JavaScript variable is called here
    hc_xAxis(
      categories = result$dest_name,               # Value in event.point.category
        tickmarkPlacement="on")})

Using `appendTab()` to create the drill-down report

The plan is to display a new drill-down report every time the end user clicks on a bar. To prevent pulling the same data unnecessarily, the code will be smart enough to simply switch the focus to an existing tab if the same bar has been clicked on before.

The new, and really cool, appendTab() function is used to dynamically create a new Shiny tab with a DataTable that contains the first 100 rows of the selection. A simple vector, called tab_list, is used to track all existing detail tabs. The updateTabsetPanel() function is used to switch to the newly or previously created tab.

The observeEvent() function is the one that “catches” the event executed by the JavaScript, because it monitors the bar_clicked Shiny input. Comments are added to the code below to cover more aspects of how to use these features.

tab_list <- NULL

observeEvent(input$bar_clicked,{  
       airport <- input$bar_clicked[1]              # Selects the first value sent in [event.point.category]
       tab_title <- paste(input$airline,            # tab_title is the tab's name and unique identifier
                          "-", airport ,            
                          if(input$month != 99)     
                            paste("-" , month.name[as.integer(input$month)]))
       
       if(tab_title %in% tab_list == FALSE){        # Checks to see if the title already exists
         details <- db_flights %>%                  # Reuses the db_flights dbplyr variable
           filter(dest_name == airport,             # Uses the [event.point.category] value for the filter
                  carrier == input$airline)         # Matches the current airline filter
         
         if(input$month != 99)                      # Matches the current month selection
            details <- filter(details, month == input$month) 
         details <- details %>%
           head(100) %>%                            # Select only the first 100 records
           collect()                                # Brings the 100 records into the R environment 
           
         appendTab(inputId = "tabs",                # Starts a new Shiny tab inside the tabsetPanel named "tabs"
                   tabPanel(
                     tab_title,                     # Sets the name & ID
                     DT::renderDataTable(details)   # Renders the DataTable with the 100 newly collected rows
                   ))
         tab_list <<- c(tab_list, tab_title)        # Adds the new tab to the list, important to use <<- 
         }
         
       # Switches over to a panel that matched the name in tab_title.  
       # Notice that this function sits outside the if statement because
       # it still needs to run to select a previously created tab
       updateTabsetPanel(session, "tabs", selected = tab_title)  
     })

Remove all tabs using `removeTab()` and `purrr`

Creating new tabs dynamically can clutter the dashboard. So a simple actionLink() button can be added to the dashboardSidebar() in order to remove all tabs except the main dashboard tab.

# This code runs in ui
  dashboardSidebar(
       actionLink("remove", "Remove detail tabs"))

The observeEvent() function is used once more to catch when the link is clicked. The walk() command from purrr is then used to iterate through each tab title in the tab_list vector and proceeds to execute the Shiny removeTab() command for each name. After that, the tab list variable is reset. Because of environment scoping, make sure to use double less than ( <<- ) when resetting the variable, so it knows to reset the variable defined outside of the observeEvent() function.

# This code runs in server
  observeEvent(input$remove,{
    # Use purrr's walk command to cycle through each
    # panel tabs and remove them
    tab_list %>%
      walk(~removeTab("tabs", .x))
    tab_list <<- NULL
  })

Conclusion

This example uses Shinydashboard to create enterprise dashboards, but there are other technologies as well. Flexdashboard is a great way to build similar enterprise dashboards in R Markdown. We used SQL Server to populate this dashboard, but you can use any database. For more information on using databases with R, see http://db.rstudio.com/.

Visualizations with R and Databases

Wed, 16 Aug 2017 00:00:00 +0000

The Challenge

Visualizations are one of R’s strengths. There are many functions and packages that create complex plots, often with one simple command. These plotting functions do two things: first, they take the raw data and run the calculations needed for a given visualization, and second, they draw the plot. If the source of the data resides within a database, the usual approach is to import all of the data and then create the plot. This is a problem, especially if the data is large.

A strategy to address this problem is found in the new Database with RStudio website. The Creating Visualizations page outlines a solution that introduces the “Transform in Database, plot in R” concept, and demonstrates its practical implementation. The article focused on knowledge sharing, rather than on providing a tool.

Introducing `dbplot`

The new dbplot package is meant to collect multiple functions for in-database visualization code. It implements the principles laid out in the Creating Visualizations page, and it provides three types of functions:

Helper functions that return a ggplot2 visualization
Helper functions that return the results of the plot’s calculations
The db_bin() function introduced in the Creating Visualizations page

The package provides calculations or “base” ggplot2 visualizations for the following:

Bar plot
Line plot
Histogram
Raster

Installation

Install dbplot from GitHub using the devtools package

devtools::install_github("edgararuiz/dbplot")

Example

This example will use a Microsoft SQL Server database connection to provide a quick glance of how the package works. For more examples, please visit the package’s GitHub repository.

dbplot functions

The dbplot_histogram() function creates a 30-bin histogram by default. Because it uses dplyr commands to perform the bin calculations, the function will work with any database that has dplyr support, including sparklyr. The only caveat is that the database must support basic functions like max() and min(), which some database types do not support.

library(dbplyr)

tbl(con, "airports") %>% 
  dbplot_histogram(alt)

This example shows how the resulting plot object can be further refined after the dbplot_histogram() function returns a plot:

tbl(con, "airports") %>% 
  dbplot_histogram(alt, binwidth = 700) + 
  labs(title = "Airports Altitude") +
  theme_minimal()

db_compute functions

If more control over the plot is needed, then the db_compute_bins() function returns a data frame with the lowest value of each bin and the record count per bin:

tbl(con, "airports") %>% 
  db_compute_bins(alt)

## # A tibble: 28 x 2
##       alt count
##     <dbl> <int>
##  1  -54.0   559
##  2  250.4   176
##  3  554.8   203
##  4  859.2   131
##  5 1163.6    82
##  6 1468.0    40
##  7 1772.4    20
##  8 2076.8    18
##  9 2381.2    16
## 10 2685.6    12
## # ... with 18 more rows

The results of the compute command can then be piped into a plot:

tbl(con, "airports") %>% 
  db_compute_bins(alt) %>%
  ggplot() +
  geom_col(aes(alt, count, fill = count))

db_bin()

The dbplot package includes the db_bin() function, first introduced in the Creating Visualizations page. For more information, please read the Histogram section.

db_bin(any_field)

## (((max(any_field) - min(any_field))/(30)) * ifelse((as.integer(floor(((any_field) - 
##     min(any_field))/((max(any_field) - min(any_field))/(30))))) == 
##     (30), (as.integer(floor(((any_field) - min(any_field))/((max(any_field) - 
##     min(any_field))/(30))))) - 1, (as.integer(floor(((any_field) - 
##     min(any_field))/((max(any_field) - min(any_field))/(30))))))) + 
##     min(any_field)

Next steps

More plots will be possible as dplyr-to-SQL translations are fine-tuned and enhanced. The dbplot package will be the place where new calculations and plots will be implemented.

Some Ideas for your Internal R Package

Wed, 19 Jul 2017 00:00:00 +0000

At RStudio, I have the pleasure of interacting with data science teams around the world. Many of these teams are led by R users stepping into the role of analytic admins. These users are responsible for supporting and growing the R user base in their organization and often lead internal R user groups.

One of the most successful strategies to support a corporate R user group is the creation of an internal R package. This article outlines some common features and functions shared in internal packages. Creating an R package is easier than you might expect. A good place to start is this webinar on package creation.

Logos and Custom CSS

Interestingly, one powerful way to increase the adoption of data science outputs - plots, reports, and even slides - is to stick to consistent branding. Having a common look and feel makes it easier for management to recognize the work of the data science team, especially as the team grows. Consistent branding also saves the R user time that would normally be spent picking fonts and color schemes.

It is easy to include logos and custom CSS inside of an R package, and to write wrapper functions that copy the assets from the package to a user’s local working directory. For example, this wrapper function adds a logo from the RStudioInternal package to the working directory:

getLogo <- function(copy_to = getwd()){
      copy_to <- normalizePath(copy_to)
      file.copy(system.file(“logo.png”, package = “RStudioInternal”) , copy_to)
}

Once available, logos and CSS can be added to Shiny apps and R Markdown documents.

ggplot2 Themes

Similar to logos and custom CSS, many internal R packages include a custom ggplot2 theme. These themes ensure consistency across plots in an organization, making data science outputs easier to recognize and read.

ggplot2 themes are shared as functions. To get started writing a ggplot2 theme, see resource 1. For inspiration, take a look at the ggthemes package.

Data Connections

Internal R packages are also an effective way to share functions that make it easy for analysts to connect to internal data sources. Nothing is more frustrating for a first time R user than trying to navigate the world of ODBC connections and complex database schemas before they can get started with data relevant to their day-to-day job.

If you’re not sure where to begin, look through your own scripts for common database connection strings or configurations. db.rstudio.com can provide more information on how to handle credentials, drivers, and config files.

learnr Tutorials

RStudio recently released a new package for creating interactive tutorials in R Markdown called learnr. There are many great resources online for getting started with R, but it can be useful to create tutorials specific to your internal data and domain. learnr tutorials can serve as training wheels for the other components of the internal R package or teach broader concepts and standards accepted across the organization. For example, you might provide a primer that teaches new users your organization’s R style guide.

Wrap Up

If you are leading an R group, an internal R package is a powerful way to support your users and the adoption of R. Imagine how easy it would be to introduce R to co-workers if they could connect to real, internal data and create a useful, beautiful plot in under 10 minutes. Investing in an internal R packages makes that on boarding experience possible.

Analytics Administration for R

Wed, 21 Jun 2017 00:00:00 +0000

Analytic administrator is a role that data scientists assume when they onboard new tools, deploy solutions, support existing standards, or train other data scientists. It is a role that works closely with IT to maintain, upgrade, and scale analytic environments. Analytic admins have a multiplier effect - as they go about their work, they influence others in the organization to be more effective. If you are a data scientist using R, you might consider filling the role of analytic admin for your organization.

Consider the data scientist who wants to make R a legitimate part of their organization. This person has to introduce a new technology and help IT build the architecture around it. In this role, the data scientist – acting as an analytic admin – influences their entire organization.

The need for analytic admins

What organizations need analytic admins? Analytic admins are important for any organization that wants to:

Modernize their analytic tools
Take advantage of all their data
Build analytic products and applications
Develop a best-in-class data science team

Despite the fact that the need for analytic admins is pervasive in industry, companies rarely list it as a dedicated role. Instead, they require teamwork between data science and IT operations, or they may require data scientists to function as their own admins. But the need is real. Most organizations need help bridging the gap between data science and IT. If you see an opportunity to function in the capacity as an analytics admin, I suggest you take it.

Analytic admins typically have to train themselves and carve out their own career. It is common for data scientists who operate as analytic admins to feel as though they are in no-man’s land. It is natural to feel lost between the worlds of data science and information technology. As someone who had been there, I can say the feeling is disorienting. However, I can also say the value of that position is tremendous. If you feel like you are operating in no-man’s land as you function in this role, just know you are exactly where you need to be.

R tooling and integration

At RStudio, we think about doing data science as a development process that begins with accessing and understanding your data, and then communicating your results. This process is thoroughly explained in the book R for Data Science, by Wickham and Grolemond.

RStudio builds open-source and enterprise-ready products to help you do data science in R. These products include the RStudio IDE, RStudio Connect, and Shiny Server. These are designed to work with open-source R packages like Shiny, R Markdown, and the Tidyverse.

Most of the software that RStudio makes is open source, but enterprises often require additional professional features. Common Professional features are security, authentication, high availability, administration, and load balancing.

R is also used with production environments for hosting web applications, exposing APIs, and automating workflows. R is sometimes integrated into other systems such as data warehouses, Hadoop, and Spark. The role of the analytic admin is to provide tooling for data scientists, as well as to integrate R into production systems.

Linux and R

RStudio products run on Linux, so understanding Linux will help you become self-sufficient, use R with other systems, and build better solutions. We will talk more about what you can do with Linux commands in an upcoming blog post.

There are many resources for learning Linux online. Here is just one offered by the Linux Foundation. Analytics admins need to know how to navigate (e.g., cd, pwd, ls), install Linux packages (e.g., apt-get install), and execute commands as root (e.g., sudo). Also important are tab completion, keyboard shortcuts, and text editors (e.g., vim, nano).

Did you know you can execute basic Linux commands from inside RStudio Server using the Tools > Shell option? You can also execute Linux commands inside the R console with the system function.

Another major benefit of learning Linux is the ability to administer production systems that run with Shiny Server, and the ability to deploy Shiny web applications into production.

Running Shiny in production

There is a growing trend in using Shiny web apps in production analytic workflows. The vibrant Shiny community now spans all verticals including pharmaceuticals, high technology, and finance. For many organizations, adopting Shiny is their first experience in running R in production.

Production environments that depend on Shiny also need analytic admins who can deploy and support these applications. For example, some organizations now have complex Shiny applications that serve hundreds of end users over a cluster of load-balanced Shiny Servers. These applications often go through a standard development > test > production deployment process. New tools are being built for correctness testing and load testing in Shiny. RStudio and other platform vendors are making significant investments in building architectures - like Shiny Server and RStudio Connect - that will help Shiny grow over the long term.

The growth of Shiny opens an opportunity to analytic admins who want to make analytic content available to a wide audience. Shiny apps allow end users who know nothing about R to take advantage of the power of the R programming language. They have the potential to influence decision-makers who can take actions and see results based on the work data scientists share with them. There is an immediate need for analytic admins who understand Shiny and can help support environments that depend on Shiny.

Getting started: Installing RStudio Server

A great way to get started learning analytics administration is to build your own open source RStudio Server on Linux. Building an RStudio Server by hand is the analytic admin equivalent of the Jedi building their own light sabers. It’s a core skill, so you should be able to do it yourself no matter what.

An easy way to get started with RStudio Server is to set it up on Ubuntu with Amazon Web Services. AWS even has an instruction guide for running R on AWS. The core commands of the install are the following four lines of code (note: this installs RStudio Server version 1.0.143).

$ sudo apt-get install r-base
$ sudo apt-get install gdebi-core
$ wget https://download2.rstudio.org/rstudio-server-1.0.143-amd64.deb
$ sudo gdebi rstudio-server-1.0.143-amd64.deb

Of course, your installation is going to require more than just installing RStudio Server. You will probably want use the CRAN repository, install Linux dependencies, add users, and manage R packages. Here is a complete script I used to set up RStudio Server on a simple AWS AMI (ami-efd0428f) using a T2-medium instance. I included instructions from this document on how to install R from CRAN. I also opened port 8787 in my AWS security group so I could log into RStudio Server via my web browser.

### Simple RStudio Server Install
### Based on AWS image: ami-efd0428f
 
## Install R from CRAN repository
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
$ sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial/'
$ sudo apt-get update
$ sudo apt-get -y install r-base
 
## Install RStudio Server version 1.0.143
$ sudo apt-get install gdebi-core
$ wget https://download2.rstudio.org/rstudio-server-1.0.143-amd64.deb
$ sudo gdebi rstudio-server-1.0.143-amd64.deb
 
## Add a new user
$ sudo useradd -m myuser
$ sudo passwd myuser
 
## (Optional - may take time) Install common Linux dependencies
$ sudo apt-get -y install libcurl4-openssl-dev openssl libssl-dev
$ sudo apt-get -y install texlive texlive-latex-extra libxml2-dev
 
## (Optional - may take time) Install common R packages
$ sudo Rscript -e 'install.packages("shiny", repos = "http://cran.rstudio.com/")'
$ sudo Rscript -e 'install.packages("tidyverse", repos = "http://cran.rstudio.com/")'
 
## Point your browser to <AWS-instance-IP>:8787

If you don’t want to install RStudio Server from scratch, there are other ways to get started. One is to use a community AMI like this one. Another is to use the AWS Marketplace to install RStudio Server Pro with 1-Click Launch.

Conclusion

Installation is just the first step to administering R. You should also consider the topics of authentication, security, scale, integration, hardware sizing, and configuration. Systems administrators have to do a lot of their own training, and analytic admins are no different. Fortunately, there are plenty of references to help you get started. Here are a few useful references for learning analytic administration for R, RStudio, and Shiny.

RStudio Products

Authentication and security

Managing R Packages

Shiny Server

Databases using R

Wed, 17 May 2017 00:00:00 +0000

Current State

Using databases is unavoidable for those who analyze data as part of their jobs. As R developers, our first instinct may be to approach databases the same way we do regular files. We may attempt to read the data either all at once or as few times as possible. The aim is to reduce the number of times we go back to the data ‘well’, so our queries extract as much data as possible. After that, we spend cycles analyzing the data in memory. Here is what this model looks like:

Because the volume of data is significant with this approach, we usually attempt to come up with strategies to minimize the resources and time it takes to analyze the data. We may try to retrieve all rows of a few columns, or a few rows of several of columns. Another tactic is to save the query results into individual files for later analysis.

An improvement to the current approach would be to use the database’s SQL Engine to perform as much of the data exploration as possible. An enterprise-grade SQL server will have more power, and will be better tuned, to execute transformations of large amounts of data. Our goal would then be to bring into R a more targeted data set that will be used for visualization and modeling.

This improvement comes at a cost: we will need to know how to write SQL queries, and will have to switch between both languages. We may also end up using an external querying tool that is able to provide a list of tables and inline SQL code helpers. Of course, this involves switching between tools. On a personal note, I used to switch from R to Microsoft SQL Management Studio. After I that, I would bring the finalized query back into my code in R.

A better way

The dplyr package simplifies data transformation. It provides a consistent set of functions, called verbs, that can be used in succession and interchangeably to gain understanding of the data iteratively. The first time I re-wrote R code using dplyr, the new script was at least half as long and much easier to understand.

Another nice thing about dplyr is that it can interact with databases directly. It accomplishes this by translating the dplyr verbs into SQL queries. This incredibly convenient feature allows us to ‘speak’ directly with the database from R, thus resolving the issues brought up in the previous section:

Run data exploration over all of the data - Instead of coming up with a plan to decide what data to import, we can focus on analyzing the data inside the database, which in turn should yield faster insights.
Use the SQL Engine to run the data transformations - We are, in effect, pushing the computation to the database because dplyr is sending SQL queries to the database.
Collect a targeted dataset - After become familiar with the data and choosing the data points that will either be shared or modeled, a final query can then be used to bring back only that data into memory in R.
All your code is in R! - Because we are using dplyr to communicate with the database, there is no need to change language, or tools, to perform the data exploration.

Example

There are three things that we will need to get started:

A database we can access
A database driver installed in either our workstation or RStudio Server
All of the required packages installed in R

In this section, we will demonstrate how to access a Microsoft SQL Server database from a workstation that is running on Microsoft Windows.

Database Driver

A database driver is a program that allows the workstation and the database to communicate. In Microsoft Windows, the drivers that connect to MS SQL databases are installed by default. We need the name of the driver that will be used inside our code in R. The easiest way to do this is to open the ODBC Data Source Administrator. To find it in your, system please refer to this article: Check the ODBC SQL Server Driver Version (Windows) . Once the administrator program is open, click on the Drivers tab. In my laptop, these are the drivers available. I will use SQL Server for the Driver argument in my connection in R.

R packages

Besides dplyr, the following packages are required:

odbc - This is the interface between the database driver and R
DBI - Standardizes the functions related to database operations
dbplyr - Enables dplyr to interact with databases. It also contains the vendor-specific SQL translations.

The database accessibility feature is still being developed, so we will use the development versions of dbplyr and dplyr.

devtools::install_github("tidyverse/dplyr")
devtools::install_github("tidyverse/dbplyr")
devtools::install_github("rstats-db/odbc")
install.packages("DBI")

Connect to the database

We will use the dbConnect() function from the DBI package to connect to the database. The value for the Driver argument is the name we determined in the Database Driver section above.

library(DBI)

con <- dbConnect(odbc::odbc(),
                   Driver    = "SQL Server", 
                   Server    = "localhost",
                   Database  = "airontime",
                   UID       = [My User ID],
                   PWD       = [My Password],
                   Port      = 1433)

A very useful function in DBI is dbListTables(), which retrieves the names of available tables.

dbListTables(con)

[1] "airlines" "airport" "airports" "faithful" "flights" "iris"

Another useful function is the dbListFields, which returns a vector with all of the column names in a table.

dbListFields(con, "flights")

[1] "year" "month" "day" "dep_time" "sched_dep_time" [6] "dep_delay" "arr_time" "sched_arr_time" "arr_delay" "carrier" [11] "flight" "tailnum" "origin" "dest" "air_time" [16] "distance" "hour" "minute" "time_hour"

Interacting with the data using dplyr

Using dplyr, we can easily preview a database. The tbl() command creates a reference to the table.

library(dplyr)
tbl(con, "flights")

Source:     table<flights> [?? x 19]
Database:   Microsoft SQL Server 12.00.4422[username@localhost/airontime]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>
1   2013     1     1      517            515         2      830            819        11      UA   1545  N14228    EWR   IAH
2   2013     1     1      533            529         4      850            830        20      UA   1714  N24211    LGA   IAH
3   2013     1     1      542            540         2      923            850        33      AA   1141  N619AA    JFK   MIA
4   2013     1     1      544            545        -1     1004           1022       -18      B6    725  N804JB    JFK   BQN
5   2013     1     1      554            600        -6      812            837       -25      DL    461  N668DN    LGA   ATL
6   2013     1     1      554            558        -4      740            728        12      UA   1696  N39463    EWR   ORD
7   2013     1     1      555            600        -5      913            854        19      B6    507  N516JB    EWR   FLL
8   2013     1     1      557            600        -3      709            723       -14      EV   5708  N829AS    LGA   IAD
9   2013     1     1      557            600        -3      838            846        -8      B6     79  N593JB    JFK   MCO
10  2013     1     1      558            600        -2      753            745         8      AA    301  N3ALAA    LGA   ORD
# ... with more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The tally() verb in dplyr returns the row count.

tally(tbl(con, "flights"))

Source:     lazy query [?? x 1]
Database:   Microsoft SQL Server 12.00.4422[username@localhost/airontime]

       n
   <int>
1 336776

When used against a database, the previous function is converted to a SQL query that works with MS SQL Server. The show_query() function displays the translation.

show_query(tally(tbl(con, "flights")))

<SQL> SELECT COUNT(*) AS "n" FROM "flights"

Bringing it all together

The last code sample shows how easy it is to find out what the top airlines are by number of flights. Additionally, we wish to see the names of the airlines and not their codes. The steps taken are:

Start with the flights table and join it to the carrier table to obtain the airline name
Group the data by the airline name
Tally the total rows by airline name
Order the data by the resulting tallies in a descending order

All of these steps are translated into a SQL statement and processed inside the database. We do not need to import the tables into R memory at any time, we just use dplyr to get the results quickly.

tbl(con, "flights") %>%
  left_join(tbl(con, "airlines"), by = "carrier") %>%
  group_by(name) %>%
  tally %>%
  arrange(desc(n))

Source:     lazy query [?? x 2]
Database:   Microsoft SQL Server 12.00.4422[username@localhost/airontime]
Ordered by: desc(n)

# S3: tbl_dbi
                       name     n
                      <chr> <int>
 1    United Air Lines Inc. 58665
 2          JetBlue Airways 54635
 3 ExpressJet Airlines Inc. 54173
 4     Delta Air Lines Inc. 48110
 5   American Airlines Inc. 32729
 6                Envoy Air 26397
 7          US Airways Inc. 20536
 8        Endeavor Air Inc. 18460
 9   Southwest Airlines Co. 12275
10           Virgin America  5162
# ... with more rows

Additional Resources

Here are links that will provide a deeper look into their respective subjects:

dplyr’s Official Site
Vignette of the DBI package
R for Data Science - An online book that covers how to use dplyr and other like packages that together are called the tidyverse.

Conclusion

When we have only one method available to us, it is sometimes hard to see its inherent flaws. The method does what we need, so we do our best to overcome its shortfalls.

Our hope is that highlighting the issues related to importing large amounts of data into R, and the advantages of using dplyr to interact with databases, will be the encouragement needed to learn more about dplyr and to give it a try.

We plan to continue writing about the subject of databases using R in future posts. We will cover different aspects and techniques to get the most out of working with these two great technologies.

R for Enterprise: Understanding R’s Startup

Wed, 19 Apr 2017 00:00:00 +0000

R’s startup behavior is incredibly powerful. R sets environment variables, loads base packages, and understands whether you’re running a script, an interactive session, or even a build command.

Most R users will never have to worry about changing R’s startup process. In fact, for portability and reproducibility of code, we recommend that users do not modify R’s startup profile. But, for system administrators, package developers, and R enthusiasts, customizing the launch process can provide a powerful tool and help avoid common gotchas. R’s behavior is thoroughly documented in R’s base documentation: “Initialization at Start of an R Session”. This post will elaborate on the official documentation and provide some examples. Read on if you’ve ever wondered how to:

Tell R about a local CRAN-like repository to host and share R packages internally
Use a different version of Python, e.g., to support a Tensorflow project
Define a proxy so R can reach the internet in locked-down environments
Understand why Packrat creates a .Rprofile
Automatically run code at the end of a session to capture and log sesssionInfo()

We’ll also discuss how RStudio starts R. Spoiler: it’s a bit different than you might expect!

.Rprofile, .Renviron, and R*.site oh my!

R’s startup process follows three steps: starting R, setting environment variables, and sourcing profile scripts. In the last two steps, R looks for site-wide files and user- or project-specific files. The R documentation explains this process in detail.

Common Gotchas and Tricks:

The Renviron file located at R_HOME/etc is unique and different from Renviron.site and the user-specific .Renviron files. Do not edit the Renviron file!
A site-wide file, and either a project file or a user file, can be loaded at the same time. It is not possible to use both a user file and a project file. If the project file exists, it will be used instead of the user file.
The environment files are plain-text files in the form name=value. The profile files contain R code.
To double check what environment variables are defined in the R environment, run Sys.getenv().
Do not place things in a profile that limit the reproducibility or portability of your code. For example, setting options(stringsAsFactors = FALSE) is discouraged because it will cause your code to break in mysterious ways in other environments. Other bad ideas include: reading in data, loading packages, and defining functions.

Where to put what?

The R Startup process is very flexible, which means there are different ways to achieve the same results. For example, you may be wondering which environment variables to set in .Renviron versus Renviron.site. (Don’t even think about calling Sys.setenv() in a Rprofile…)

A simple rule of thumb is to answer the question: “When else do I want this variable to be set?”

For example, if you’re on a shared server and you want the settings every time you run R, place .Renviron or .Rprofile in your home directory. If you’re a system admin and you want the settings to take affect for every user, modify Renviron.site or Rprofile.site.

The best practice is to scope these settings as narrowly as possible. That means if you can place code in .Rprofile instead of Rprofile.site you should! This practice complements the previous warnings about modifying R’s startup. The narrowest scope is to setup the environment within the code, not the profile.

Quiz

What is the best way to modify the path? The answer depends on the desired scope for the change.

For example, in an R project using the Tensorflow package, I might want R to use the version of Python installed in /usr/local/bin instead of /usr/bin. This change is best implemented by reordering the PATH using PATH=/usr/local/bin:${PATH}. This is a change I only want for this project, so I’d place the line in a .Renviron file in the project directory.

On the other hand, I may want to add the JAVA SDK to the path so that any R session can use the rJava package. To do so, I’d add a line like PATH=${PATH}:/opt/jdk1.7.0_75/bin:/opt/jdk1.7.0_75/jre/bin to Renviron.site.

R Startup in RStudio

A common misconception is that R and RStudio are one in the same. RStudio runs on top of R and requires R to be installed separately. If you look at the process list while running RStudio, you’ll see at least two different processes: usually one called RStudio and one called rsession.

RStudio starts R a bit differently than running R from the terminal. Technically, RStudio doesn’t “start” R, it uses R as a library, either as a DLL on Windows or as a shared object on Mac and Linux.

The main difference is that the script wrapped around R’s binary is not run, and any customization to the script will not take affect. To see the script try:

cat $(which R)

For most people, this difference won’t be noticeable. Any settings in the startup files will still take affect. For user’s that build R from source, it is important to include the --enable-R-shlib flag to ensure R also builds the shared libraries used by RStudio.

R Startup in RStudio Server Pro

RStudio Server Pro acts differently from R and the open-source version of RStudio. Prior to starting R, RStudio Server Pro uses PAM to create a session, and sources the rsession-profile. In addition, RStudio Server Pro launches R from bash, which means settings defined in the user’s bash profile are available.

In short, RStudio Server Pro provides more ways to customize the environment used by R. You might ask why you’d ever want more options. Recall our rule of thumb: “When else do I want this variable to be set?”

In server environments, there are often environment variables set every time a user interacts with the server. These environment variables are placed in a user’s bash profile by a system admin. Normally R wouldn’t pick up these settings. RStudio Server Pro allows R to make use of the work the system admin has already done by picking up these profiles.

Likewise, there may be some actions that take place on the server when a user logs in that have to happen before R starts. For example, a Kerberos ticket used by the R session to access a data source must exist before R is started. RStudio Server Pro uses PAM sessions to enable these actions.

There may also be actions or variables that should only be defined for RStudio, and not any other time R is run. To facilitate this use case, RStudio Server Pro provides the rsession-profile. For example, if your environment makes use of RStudio Server Pro’s support for multiple versions of R, you’d place any environment variables that should defined for all versions of R inside of rsession-profile.

Examples:

Define proxy settings in Renviron.site

Renviron.site is commonly used to tell R how to access the internet in environments with restricted network access. Renviron.site is used so the settings take affect for all R sessions and users. For example,

http_proxy=http://proxy.mycompany.com

This article contains more details on how to configure RStudio to use a proxy.

Add a local CRAN repository for all users

Organizations with offline environments often use local CRAN repositories instead of installing packages directly from a CRAN mirror. Local CRAN repositories are also useful for sharing internally developed R packages among colleagues.

To use a local CRAN repository, it is necessary to add the repository to R’s list of repos. This setting is important for all sessions and users, so Rprofile.site is used.

old_repos <- getOption("repos")
local_CRAN_URI <- paste0("file://", normalizePath("path_to_local_CRAN_repo"))
options(repos = c(old_repos, my_repo = lcoal_CRAN_URI))

More information on setting up a local CRAN repository is available here.

Record sessionInfo automatically

Reproducibility is a critical part of any analysis done in R. One challenge for reproducible scripts and documents is tracking the version of R packages used during an analysis.

The following code can be added to a .Rprofile file within an RStudio project to automatically log the sessionInfo() after every RStudio session.

This log could be referenced if an analysis needs to be run at a later date and fails due to a package discrepancy.

.Last <- function(){
  if (interactive()) {
    
    ## check to see if we're in an RStudio project (requires the rstudioapi package)
    if (!requireNamespace("rstudioapi"))
      return(NULL)
    pth <- rstudioapi::getActiveProject()
    if (is.null(pth))
      return(NULL)
    
    ## append date + sessionInfo to a file called sessionInfoLog
    cat("Recording session info into the project's sesionInfoLog file...")
    info <-  capture.output(sessionInfo())
    info <- paste("\n----------------------------------------------",
                  paste0('Session Info for ', Sys.time()),
                  paste(info, collapse = "\n"),
                  sep  = "\n")
    f <- file.path(pth, "sessionInfoLog")
    cat(info, file = f, append = TRUE)
  }
}

Automatically turn on packrat

Packrat is an automated tool for package management and reproducible research. Packrat acts as a super-set of the previous example. When a user opts in to using packrat with an RStudio project, one of the things packrat automatically does is create (or modify) a project-specific .Rprofile. Packrat uses the .Rprofile to ensure that each time the project opens, Packrat mode is turned on.

To Wrap Up

R’s startup behavior can be complex, sometimes quirky, but always powerful. At RStudio, we’ve worked hard to ensure that R starts and stops correctly whether you’re running RStudio Desktop, serving a Shiny app on shinyapps.io, rendering a report in RStudio Connect, or supporting hundreds of users and thousands of sessions in a load balanced configuration of RStudio Server Pro.

R Markdown for the Enterprise

Wed, 25 Jan 2017 00:00:00 +0000

In the corporate world, spreadsheets and PowerPoint presentations still dominate as the tools used for analyzing and sharing information. So, it is not at all surprising that even when business analysts use R for the analytical heavy lifting, they frequently revert to using spreadsheets and slide decks to share their results. This may seem like the easiest way to communicate with colleagues, but any modestly complicated project is likely to be error-prone and generate hours of unnecessary rework.

An R-savvy analyst can harness R Markdown to develop reproducible business reporting and information sharing workflows in any business organization; all it takes is a little effort to master some basic R document preparation tools.

In this post, I would like to examine a scenario that represents some experiences I had as an analytics professional.

“The report is great but…” Scenario

A new R analysis is delivered in a PowerPoint presentation, and everyone thinks that the insights are very valuable. They all want more associates to see it, so almost immediately, the following three requests are made:

“…we need it broken out by” - The presentation needs to be split by a specific segment. The segment is normally geographical or managerial in nature.
“…they shouldn’t see each others data” - Since the results are not published in a central publishing platform, it is necessary to create multiple versions of the same report in order to secure the contents.
“…we need it every” - Satisfying requests 1 and 2 may not be too overwhelming if this were meant as a one-time analysis, but usually the analysis and its distribution need to be repeated on a regular interval.

Because we exported the findings into a presentation, sharing results becames more complex and time-consuming if we wish to satisfy the new requirements.

How can R Markdown help?

R Markdown combines the creation and sharing steps. The three requests can be satisfied using the following features of R Markdown:

Break out the reports - Using R Markdown’s Parameterized Reports feature, we can easily create documents for each required segment.
Automate the file creation - R Markdown can be run from code, so a separate R script can iteratively run the R Markdown and pass a different parameter for each iteration.
Create the slides inside R - Take advantage of R Markdown Presentation output to create a slide deck. Without having to learn a new scripting language, we can code the slide deck and use the same Parameter feature to automate its creation.
Keep the interactivity - In many cases, the end user needs a level of interactivity with the report. This interactivity can be achieved by using htmlwidgets inside the R Markdown document. For example, the Leaflet widget can be used for interactive maps, the Data Table widget for interactive tables, and the dygraphs widget for interactive time series charting.

Additional benefits

Accessible and easy to open - Any alternative tool needs to be as accessible as the current spreadsheet and presentation tool. R Markdown can output results in HTML, PDF, and Word. Additionally, the Presentation output uses the highly accessible HTML5 format.
Reproducibility - Copying-and-pasting files, text, or images inevitably introduces human error. In R, data import, wrangling and modeling are already automated, so why not take it to its natural conclusion by using R Markdown to automate the presentation end of the process, as well?
Creating a dashboard is easy - In a spreadsheet, this is normally accomplished with a combination of pivot tables and graphs. R Markdown uses flexdashboard to create visually striking dashboards that are self-contained. By using this in combination with htmlwidgets, the audience gains access to a very powerful tool.

Here is an example of a live parameterized R Markdown flexdashboard based on stock data:

How to get started

R Markdown is a free package, so if you have R (and ideally RStudio), you can start using it today. Also, there are a lot of resources available for learning how to use R Markdown; the package’s official website is a good place to start.

Here is a sample script that uses Parameterized R Markdown to create a slide deck based on a selected stock. In this case we used Google:

And here is the resulting deck. Press the left arrow key to see the next slide:

This simple script creates an nice-looking and interactive deck that needs no manual intervention if the data needs to be refreshed, and one small parameter change if a different stock is to be selected.

Final thought

We encourage you to try R Markdown yourself. The “start small and then build big” strategy rarely fails, so you could begin by automating a simple report first, and then start taking advantage of more advanced features as you grow comfortable with the tool.

R for Enterprise: How to Scale Your Analytics Using R

Wed, 21 Dec 2016 00:00:00 +0000

At RStudio, we work with many companies interested in scaling R. They typically want to know:

How can R scale for big data or big computation?
How can R scale for a growing team of data scientists?

This post provides a framework for answering both questions.

Scaling R for Big Data or Big Computation

The first step to scaling R is understanding what class of problems your organization faces. At RStudio, we think of three use cases: data extraction, embarrassingly parallel problems, and analysis on the whole. Garrett Grolemund hosted an excellent webinar on Big Data in R, in which he outlined the differences in these three cases.

DISCLAIMER: These three cases are not exhaustive, nor are most problems easily categorized into one of the three classes. But, when scoping a scaled R environment, it is imperative to understand which class needs to be enabled. Your organization might have all three cases, or it might have only one or two.

Case 1: Compute on the data extract

Example: I want to build a predictive model. I only need a few dozen features and a three-month window to build a good model. I can also aggregate my data from the transaction level to the user level. The result is a much smaller data set that I can use to train my model in R.

Computing on data extracts is arguably the most common use case; an analyst will run a query to pull a subset of data from an external source into R. If your data extracts are large, you can run R on a server. At RStudio, we recommend using the server version of the IDE (either open-source or professional), but there are many ways to use R interactively on a server.

Case 2: Compute on the parts

Example: When I worked at a national lab (NREL), we validated fuel economy models against real-world datasets. Each dataset had hundreds of recorded trips from individual vehicles. While the total dataset was TBs, each individual trip was a few hundred MBs. We ran independent models in parallel against each trip. Each of these jobs added a single line to a results file. Then we aggregated the results with a reduction step (taking a weighted mean). By using an HPC system, a task that would take weeks to run sequentially was completed in a few hours.

Compute on the parts happens when the analyst needs to run the same analysis over many subsets of data, or needs to run the same analysis many times, and each model is independent of the others.

Examples include cross validation, sensitivity analysis, and model scoring. These problems are called: “embarrassingly parallel” (often a misnomer, since scaling for embarrassingly parallel problems is rarely embarrassingly simple).

Compute on the parts with a single machine

By default, R is single threaded; however, you can also use R packages to do parallel processing on a multicore server or a multicore desktop. Local parallelization is facilitated by packages like parallel, snow, foreach, etc. These packages parallelize your R commands by running them on independent threads in multicore processors. Alternatively, low-level parallelization can be facilitated with packages like Rcpp and RcppParallel. These packages facilitate the interaction of R with C++.

Compute on the parts with a high performance cluster (HPC)

In some cases, R users have access to High Performance Computing environments. These environments are becoming more readily available with technologies like Docker Swarm. An R user will test R code interactively (on an edge node or their local machine), and then submit the R code to the cluster as a series of batch jobs. Each batch job will call R on a slave node.

Note that RStudio, as an interactive IDE, may run on an edge node of the cluster or on a local machine. RStudio does not run on the slave nodes. Only R is run on the slave nodes and is executed in batch (not interactively).

One challenge faced by R users is knowing how to submit batch jobs to the cluster, tracking their progress, and re-running jobs that fail. One solution is the batchtools package. This package abstracts the details of job submission and tracking into a series of R function calls. The R functions, in turn, use generic templates provided by system administrators. Parallel R with Batch Jobs provides a nice overview. Some analysts have created Shiny applications that leverage these functions to provide an interactive Job Management interface from within RStudio!

One challenge faced by system administrators is ensuring the dependencies for the batch R script are available on all the slave nodes. Dependencies include: data access, the correct version of R, and correct versions of R packages. One solution is to store the R binaries and package libraries on shared storage (accessible by every slave node), alongside shared data and the project’s read/write scratch space.

Case 2: Compute on the parts. Technologies: parallel, snow, RcppParallel, LSF, SLURM, Torque, Docker Swarm

Case 3: Compute on the whole

Example: A recommendation engine for movies that is robust to “unique” tastes. The entire domain space needs to be considered all at once. Image classification falls into this class; the weights for a complex neural network need to be fit against the entire training set. This class of problem is the most difficult to solve, and has generated the most hype. Sometimes analysts will purchase, use, and modify ready-made implementations of these algorithms.

Computing on the whole happens when the analyst needs to run a model against an entire dataset, and the model is not embarrassingly parallel or the data does not fit on a single machine. Typically, the analyst will leverage specialized tools such as MapReduce, SQL, Spark, H20.ai, and others. R is used as an orchestration layer. Orchestration involves using R to run jobs in other languages. R has a long history of orchestrating other languages to accomplish computationally intensive tasks. See Extending R by John Chambers.

When orchestrating a case 3 problem, the R analyst will use R to direct an external computation engine that does the heavy lifting. This approach is very similar case 1. For example, Oracle’s Big Data Appliance and Microsoft SQL Server 2016 with R Server both include routines for fitting models in the database. These routines are accessible as specialized R functions. These functions are used in addition to case 1 extracts created with traditional SQL queries through RODBC or dplyr.

Another example is Apache Spark. The R analyst will work from an edge node running R. (The open-source or professional RStudio Server can facilitate this interactive use.) In R, the user will call functions from a specialized R package, which in turn accesses Spark’s data processing and machine learning routines. One available R package is sparklyr.

Note that the machine learning routines are not running in R. The analyst uses these routines as black boxes that can be pieced together into pipelines, but not modified directly.

Case 3: Compute on the whole. Technologies: Hadoop, Spark, Tensorflow, In-DB computing (RevoScaleR, OracleR, Aster, etc)

Multiple Users: Scaling R for Teams

As organizations grow, another concern is how to scale R for a team of data scientists. This type of scale is orthogonal to the previous topic. Scaling for a team addresses questions like: How can analysts share their work? How can compute resources be shared? How does R integrate with the IT landscape? In many cases, these questions need to be answered even if the R environment doesn’t need to scale for big data.

Scaling R for teams. Technologies: Version control (Git, SVN), miniCRAN, RStudio Server Pro

Open-source packages can address many of these concerns. For example, many organizations use packrat and miniCRAN to manage R’s package ecosystem. The use of version control become increasingly important as teams grow and work together. Many companies will create internal R packages to facilitate sharing things like data access scripts, ggplot2 themes, and R Markdown templates. Airbnb provides a detailed example. For more information on version control, packrat, and packages, see the webinar series RStudio Essentials. At RStudio, we recommend using RStudio Server Pro because its features such as load balancing, multi-session support, collaborative editing and auditing are designed specifically to support a large numbers of user sessions.

Wrap Up

Whether you need to compute on big data, grow your analytic team, or do both, R has tools to help you succeed. As more companies look to data to drive business decisions, creating a scaleable R environment will be a critical step towards success. Many of the topics in this blog deserve their own posts. However, understanding and discussing these different types of scale can help create the correct roadmap. If you’ve created an R environment at scale, we’d love to hear from you. In a later post, we’ll address another outstanding question: after I scale the R platform, how do I scale the distribution of results and insights to non-R users?

Make R a Legitimate Part of Your Organization

Wed, 16 Nov 2016 00:00:00 +0000

How R Enters Through the Back Door

In many organizations, R enters through the back door when analysts download the free software and install it on their local workstations.

Jamie has been an avid R programmer since college. When she takes a new job at a large corporation, she finds that she is the only analyst in the company who knows and uses R. In addition to the other tools her company gives her, Jamie decides to download R onto her laptop. She installs R without consulting her manager or IT. With R she can pull data, build models, and create nice reports. Her manager knows nothing about R, but goes along with it because Jamie is happy and doing quality work. Her co-workers, ever curious about analytics, also download R and learn from Jamie. Before long, R becomes an important part of the day-to-day operations of her team. When Jamie starts hiring new analysts, she lists R as a required skill. Now Jamie wants to “go big” by putting R on the company servers so she can scale her analyses, socialize her results, and integrate her apps. Unfortunately, she finds she is unable to get the resources she needs because R is not officially recognized in the company.

Whether you are an analyst wanting to do more, a stakeholder wanting a competitive analytic platform, or an IT professional wanting a controlled and secured environment, you should make R a legitimate part of your organization and get the resources needed to support it.

Bringing R Through the Front Door

All organizations have a process for onboarding software through official channels. If you are part of a large organization, your IT department probably has a review board whose purpose is to review and make decisions about new tools.

The last time I introduced R into an organization, I created a presentation explaining why R should be supported by IT. My proposal was presented to the IT review board. It included slides on cost savings, strategic advantages, hardware and software requirements, and more. It took a few iterations to get through all of the requirements, but in the end, the board approved R as an additional analytic standard, paving the way for future growth.

The review board is responsible for:

Reviewing new software initiatives and approve expenditures. Does this tool increase or decrease costs? What line items will this go under? What is the long-term cost projected to be? What is the cost of support?
Supporting the organization’s strategic vision. Does the tool help satisfy a customer need? Does it help us remain competitive? Can it help us attract better talent? Does it make existing systems more efficient and agile?
Complying with existing systems architectures. Does the tool integrate with other supported tools? Will it be used in development and/or production? Does it duplicate the capabilities of other supported tools?
Managing risk and ensure security. Does the tool comply with our formal security policies? Do the software licenses meet our legal requirements?
Defining roles and responsibilities for support. What groups own the tool? What support is offered with the tool? What internal resources will be required to maintain it? Who will provide training?

Because of R’s popularity and explosive growth, many organizations are friendly and even eager to bring R through the front door. If your organization is friendly toward R but has not made it an official part of the organization, a formal review process is still valuable. The review process gives IT a formal stake in the ground when it comes to supporting R for the long term. It also makes future decisions about growth and investment much easier.

The Ubiquity of Open Source Software

Here at RStudio, we work with customers every day who want to bring R through the front door. One complaint we sometimes hear is that IT does not want to support open-source software (OSS). The reality is that most organizations are already supporting OSS. The 2016 future of open source survey estimated that 78% of companies run part or all of its operations on OSS.

Most organizations know about R by now. IEEE Spectrum ranked R fifth in the top programming languages of 2016 making it one of the most commonly used analytic tools in industry today.

Some organizations struggle to standardize on R due to a lack of management and governance around OSS. At the same time, organizations may neglect R on user workstations, thereby increasing security, legal, and operational risks. It is riskier to leave R unmanaged than it is to bring it through the front door.

Getting the Resources You Need

Passing the review board should get you resources. You’ll need physical resources and human resources to build, scale, and maintain an R environment.

Physical Resources

Investing resources in R is a great way to legitimize it. An organization that allocates budget and people into R will also expect to see value from that investment. In effect, spending money on R is legitimizing it. Some resources you need might include:

A budget or line items in a budget
Physical or virtual hardware
Software tools and licenses

Human Resources

The type of IT support you get will depends on how your IT organization is structured. You might have a single admin designated to support R, or you might have an entire support team. You will probably want to define the following roles and responsibilities:

An R advocate who promotes R
An executive sponsor who supports the R users
A designated R admin or R support team

Your IT support will manage your environment, so getting the right people and policies in place is critical. Generally speaking, having a point of contact in IT — a name and a face — is a good thing. Your admin support should be familiar with the Linux operating system. Training admins on R-related issues is also helpful.

Adopting R

After you bring R through the front door and it becomes part of the organization, you should have a vision and path for growth. You should also have resources to support that growth. So what are you going to do with your newfound resources?

The next step is adoption. Adoption means R is self-sustaining. The goal is for your organization to fully embrace R as an integral part of your business. The survival of R should not depend on one or two R advocates any more than SQL depends on one or to DBAs. Instead, there should be systems, resources, and people in place that will sustain the growth of R.

Making R a legitimate part of your organization and getting the resources you need to support it is the foundation for future growth and adoption.