Automating Summary of Surveys with RMarkdown

Tue, 07 Nov 2017 00:00:00 +0000

This guide shows how to automate the summary of surveys with R and R Markdown using RStudio. This is great for portions of the document that don’t change (e.g., “the survey shows substantial partisan polarization”). The motivation is really twofold: efficiency (maximize the reusabililty of code, minimize copying and pasting errors) and reproducibility (maximize the number of people and computers that can reproduce findings).

The basic setup is to write an Rmd file that will serve as a template, and then a short R script that loops over each data file (using library(knitr)). The render function then turns the Rmd into documents or slides (typically in PDF, HTML, or docx) by taking file metadata as a parameter.

There are countless ways to summarize a survey in R. This guide shows a few basics with ggplot and questionr, but focuses on the overall workflow (file management, etc.). Following the instructions here, you should be able to reproduce all four reports (and in principle, many more) despite only writing code to clean one survey. Most of the code is displayed in this document, but all code is found in either pewpoliticaltemplate.Rmd or pew_report_generator.R. All code, as well as the outputted documents, can be found here, and details on obtaining the data are found below.

Software

RStudio’s interface with library(rmarkdown) is evolving rapidly. Installing the current version of RStudio is highly recommended, particularly for the previews of the R Markdown code (this doc was created with RStudio 1.1.83). (Here is my install guide, which includes links to tutorials and cheat sheets. For somewhat more advanced survey data cleaning, click here.)

Even if you’ve knit Rmd before, your libraries may not be new enough to create parameterized reports. I recommend installing pacman, which has a convenience function p_load that smooths package installation, loading, and maintenance. I’d recommend p_load particularly if you are collaborating, say, on Dropbox.

install.packages("pacman")
p_load(rmarkdown, knitr, foreign, scales, questionr, tidyverse, update = TRUE)

Remember PDF requires LaTeX (install links). By contrast, knitting to docx or HTML does not require LaTeX. Creating pptx is possible in R with library(ReporteRs), but is not discussed here.

The Data

Download the four “political surveys” from Pew Research available here (i.e., January, March, August, and October 2016). You may recall, some politics happened in 2016. (The data is free, provided you take a moment to make an account.)

If need be, decompress each zip folder.

Three of my folders have intuitive names (Jan16, Mar16, and Oct16), but one of my folders picked up a lengthy name, http___www.people-press.org_files_datasets_Aug16. Don’t worry about that.

Create a new folder, call it, say, automating.
Move all four data folders into automating.

Please note that I have no affiliation (past or present) with Pew Research. I simply think that they do great work and they make it relatively hassle-free to get started with meaningful data sets.

The R Notebook (R Markdown) Template

(R Markdown ninjas can skip this section.)

In RStudio, create a new R Notebook and save it as pewpoliticaltemplate.Rmd in the automating folder you just created. This document will likely knit to HTML by default; hold down the Knit button to change it to PDF. Add fields to the header as desired. The sample header below automatically puts today’s date on the document by parsing the expression next to Date: as R code. classoption: landscape may help with wide tables. You can also specify the file that contains your bibliography in several formats, such as BibTex and EndNote (citation details).

Next add an R code chunk to pewpoliticaltemplate.Rmd to take care of background stuff like formatting. Though setting a working directory would not be needed just to knit the Rmd, the directory must be set by knitr::opts_knit$set(root.dir = '...') to automate document prep. (setwd isn’t needed in the Rmd, but setting the working directory separately in Console is recommended if you’re still editing.)

The Play button at the top right gives a preview of the code’s output, which is handy. If some part of the analysis is very lengthy, it only needs to be run once, freeing you to tinker with graphics and the like.

– Now the default settings have been set and you don’t need to worry about suppressing warnings and so on with each code chunk. You can, of course, change them case-by-case as you like.

– Unlike in R, when setting the format options for individual code chunks (as shown above to suppress warnings before the defaults kick in), you do need to type out the words TRUE and FALSE in full.

– Unlike the template, in this doc, I’ve set the defaults to echo = TRUE and tidy = TRUE to display the R code more pleasingly.

– The setting asis = TRUE is very useful for professionally formatted tables (shown below), but is not recommended for raw R output of matrix and tables. To make raw data frames display with kable by default, see here.

The Template

I find it easiest to write a fully working example and then make little changes as needed so that knitr::render() can loop over the data sets. First things first.

survey <- read.spss("Jan16/Jan16 public.sav", to.data.frame = TRUE)

Summary stats can easily be inserted into the text like so:

The template contains additional examples with survey weights (lengthier calculations should be done in blocks of code and then their result referenced with that inline style).

Here is a basic plot we might want, which reflects the survey weights. facet_grid() is used to create analogous plots for each party identification. The plot uses the slightly wonky syntax y = (..count..)/sum(..count..) to display the results as percentages rather than counts. Note that some code that cleans the data (mostly shortening labels) is omitted for brevity, but can be found here.

PA <- ggplot(survey) + theme_minimal()
PA <- PA + geom_bar(aes(q1, y = (..count..)/sum(..count..), weight = weight, 
    fill = q1))
PA <- PA + facet_grid(party.clean ~ .) + theme(strip.text.y = element_text(angle = 45))
PA <- PA + xlab("") + ylab("Percent of Country")
PA <- PA + ggtitle("Presidential Approval: January 2016")
PA <- PA + scale_y_continuous(labels = scales::percent)
PA

Here is an example of a weighted crosstab. knitr::kable will create a table that’s professional in appearance (when knit as PDF; kable takes the style of an academic journal).

kable(wtd.table(survey$ideo, survey$sex, survey$weight)/nrow(survey), digits = 2)

	Male	Female
Very conservative	0.04	0.03
Conservative	0.14	0.13
Moderate	0.20	0.20
Liberal	0.08	0.09
Very liberal	0.03	0.03
DK*	0.02	0.03

Suppose we want to display Presidential Approval, where the first column provides overall approval and subsequent columns are crosstabs for various factors of interest (using the cell weights). I’ve written a convenience function called Xtabs that creates this format, which is common in the survey world.

source("https://raw.githubusercontent.com/rdrr1990/datascience101/master/automating/Xtabs.R")
kable(Xtabs(survey, "q1", c("sex", "race"), weight = "cellweight"))

	Overall	Male	Female	White (nH)	Black (nH)	Hispanic	Other	DK*
Approve	45.6%	21.3%	24.2%	20.4%	9.48%	10.2%	4.97%	0.443%
Disapprove	48.5%	25.5%	23%	39.6%	1.33%	3.95%	2.53%	1.12%
Don’t Know (VOL)	5.94%	2.67%	3.27%	3.39%	0.646%	1.14%	0.489%	0.269%

Suppose we want to do many crosstabs. The syntax survey$ideo is widely used for convenience, but survey[["ideo"]] will serve us better since it allows us to work with vectors of variable names ( details from win-vector ). Below, the first two calls to comparisons are identical, but the final one is not because there is no variable “x” in the data frame survey.

identical(survey$ideo, survey[["ideo"]])

[1] TRUE

x <- "ideo"
identical(survey[[x]], survey[["ideo"]])

[1] TRUE

identical(survey[[x]], survey$x)

[1] FALSE

So say we want weighted crosstabs for ideology and party ID crossed by all question 20, 21, 22.. 29. Here is some code that will do that.

x <- names(survey)[grep("q2[[:digit:]]", names(survey))]
x

 [1] "q20"  "q21"  "q22a" "q22b" "q22c" "q22d" "q22e" "q22f" "q22g" "q22h"
[11] "q22i" "q25"  "q26"  "q27"  "q28"

y <- c("ideo", "party")

for (i in x) {
    for (j in y) {
        cat("\nWeighted proportions for", i, "broken down by", j, "\n")
        print(kable(wtd.table(survey[[i]], survey[[j]], survey$weight)/nrow(survey), 
            digits = 2))
        cat("\n")  # break out of table formatting
    }
    cat("\\newpage")
}

A few notes:

– This code will only work with the asis setting (shown above) that lets knitr interpret the output of print(kable()) as something to render (rather than just Markdown code to display for use elsewhere).

– Ideally one would have a csv or data.frame of the questions, and display them as loop-switched questions. In this case, the questionnaire is in a docx and so library(docxtrackr) may help.

– Rather than a nested loop, one would likely prefer to pick a question, loop over the demographic and ideological categories for the crosstabs, and then insert commentary and overview.

– The outer loops makes a new page each time it is run, with the inner loop with cat("\\newpage")), which is specific to rendering as PDF. Extra line breaks \n are needed to break out of the table formatting and keep code and text separate. A different approach to page breaks is needed for docx.

Adapting the Template with Parameters

The next step is to add a parameter with any variables you need. The parameters will be controlled by the R script discussed below. There is, of course, quite a bit of choice as to what is controlled by which file, but often only a handful of parameters are necessary. Add the following to the end of the header of pewpoliticaltemplate.Rmd:

params:
  spssfile: !r  1
  surveywave: !r 2016

That creates variables params$spssfile and params$surveywave that can be controlled externally from other R sessions, and gives them default values of 1 and 2016 respectively. Setting default values smooths debugging by allowing you to continue knitting the Rmd on its own (rather than from the R script we will create in a moment… You can also click on Knit and choose Knit with Parameters to specify particular values).

Now make any changes to Rmd template. For example, in the ggplot code…

PA <- PA + ggtitle(paste("Presidential Approval:", params$surveywave))

Notice we can get a list of all the spss files like so:

dir(pattern = "sav", recursive = TRUE)

[1] "http___www.people-press.org_files_datasets_Aug16/Aug16 public.sav"
[2] "Jan16/Jan16 public.sav"                                           
[3] "March16/March16 public.sav"                                       
[4] "Oct16/Oct16 public.sav"

or in this case

dir(pattern = "16 public.sav", recursive = TRUE)

[1] "http___www.people-press.org_files_datasets_Aug16/Aug16 public.sav"
[2] "Jan16/Jan16 public.sav"                                           
[3] "March16/March16 public.sav"                                       
[4] "Oct16/Oct16 public.sav"

I recommend making the pattern as specific as possible in case you or your collaborators add other spss files with similar names. To use regular expressions to specify more complicated patterns, see here.

Now back to editing pewpoliticaltemplate.Rmd…

Knit the file to see how it looks with these default settings; that’s it for this portion.

Automating with knitr

Now create a new R script; mine’s called pew_report_generator.R. It’s just a simple loop that tells which data set to grab, as well as the label to pass to the Rmd. Note that the labels appear in alphabetical rather than chronological order as a function of the way that the Rmd happens to find the files.

library(pacman)
p_load(knitr, rmarkdown, sessioninfo)

setwd("/users/mohanty/Desktop/pewpolitical/")

waves <- c("August 2016", "January 2016", "March 2016", "October 2016")

for (i in 1:length(waves)) {
    render("pewpoliticaltemplate.Rmd", params = list(spssfile = i, surveywave = waves[i]), 
        output_file = paste0("Survey Analysis ", waves[i], ".pdf"))
}

session <- session_info()
save(session, file = paste0("session", format(Sys.time(), "%m%d%Y"), ".Rdata"))

That’s it. Of course, in practice you might write some code on the first survey that doesn’t work for all of them. Pew, for example, seems to have formatted the survey date differently in the last two surveys, which led me to make a few changes. But if the data are formatted fairly consistently, a one-time investment can save massive amounts of time lost to error-prone copying and pasting.

A Little Version Control

The last bit of code is not necessary, but is a convenient way to store which versions of which libraries were actually used on which version of R. If something works now but not in the future, install_version (found in library(devtools)) can be used to install the older version of particular packages.

s <- session_info()
s$platform

 setting  value                       
 version  R version 3.4.2 (2017-09-28)
 os       macOS Sierra 10.12.6        
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2017-11-06

s$packages

 package     * version date       source                          
 assertthat    0.2.0   2017-04-11 CRAN (R 3.4.0)                  
 backports     1.1.1   2017-09-25 CRAN (R 3.4.2)                  
 bindr         0.1     2016-11-13 CRAN (R 3.4.0)                  
 bindrcpp      0.2     2017-06-17 CRAN (R 3.4.0)                  
 broom         0.4.2   2017-02-13 CRAN (R 3.4.0)                  
 cellranger    1.1.0   2016-07-27 CRAN (R 3.4.0)                  
 clisymbols    1.2.0   2017-05-21 cran (@1.2.0)                   
 colorspace    1.3-2   2016-12-14 CRAN (R 3.4.0)                  
 digest        0.6.12  2017-01-27 CRAN (R 3.4.0)                  
 dplyr       * 0.7.4   2017-09-28 cran (@0.7.4)                   
 evaluate      0.10.1  2017-06-24 CRAN (R 3.4.1)                  
 forcats       0.2.0   2017-01-23 CRAN (R 3.4.0)                  
 foreign     * 0.8-69  2017-06-22 CRAN (R 3.4.2)                  
 formatR       1.5     2017-04-25 CRAN (R 3.4.0)                  
 ggplot2     * 2.2.1   2016-12-30 CRAN (R 3.4.0)                  
 glue          1.2.0   2017-10-29 CRAN (R 3.4.2)                  
 gtable        0.2.0   2016-02-26 CRAN (R 3.4.0)                  
 haven         1.1.0   2017-07-09 CRAN (R 3.4.1)                  
 highr         0.6     2016-05-09 CRAN (R 3.4.0)                  
 hms           0.3     2016-11-22 CRAN (R 3.4.0)                  
 htmltools     0.3.6   2017-04-28 CRAN (R 3.4.0)                  
 httpuv        1.3.5   2017-07-04 CRAN (R 3.4.1)                  
 httr          1.3.1   2017-08-20 cran (@1.3.1)                   
 jsonlite      1.5     2017-06-01 CRAN (R 3.4.0)                  
 knitr       * 1.17    2017-08-10 CRAN (R 3.4.1)                  
 labeling      0.3     2014-08-23 CRAN (R 3.4.0)                  
 lattice       0.20-35 2017-03-25 CRAN (R 3.4.2)                  
 lazyeval      0.2.1   2017-10-29 CRAN (R 3.4.2)                  
 lubridate     1.7.0   2017-10-29 CRAN (R 3.4.2)                  
 magrittr      1.5     2014-11-22 CRAN (R 3.4.0)                  
 mime          0.5     2016-07-07 CRAN (R 3.4.0)                  
 miniUI        0.1.1   2016-01-15 CRAN (R 3.4.0)                  
 mnormt        1.5-5   2016-10-15 CRAN (R 3.4.0)                  
 modelr        0.1.1   2017-07-24 CRAN (R 3.4.1)                  
 munsell       0.4.3   2016-02-13 CRAN (R 3.4.0)                  
 nlme          3.1-131 2017-02-06 CRAN (R 3.4.2)                  
 pacman      * 0.4.6   2017-05-14 CRAN (R 3.4.0)                  
 pkgconfig     2.0.1   2017-03-21 CRAN (R 3.4.0)                  
 plyr          1.8.4   2016-06-08 CRAN (R 3.4.0)                  
 psych         1.7.8   2017-09-09 CRAN (R 3.4.1)                  
 purrr       * 0.2.4   2017-10-18 CRAN (R 3.4.2)                  
 questionr   * 0.6.3   2017-11-06 local                           
 R6            2.2.2   2017-06-17 CRAN (R 3.4.0)                  
 Rcpp          0.12.13 2017-09-28 cran (@0.12.13)                 
 readr       * 1.1.1   2017-05-16 CRAN (R 3.4.0)                  
 readxl        1.0.0   2017-04-18 CRAN (R 3.4.0)                  
 reshape2      1.4.2   2016-10-22 CRAN (R 3.4.0)                  
 rlang         0.1.2   2017-08-09 CRAN (R 3.4.1)                  
 rmarkdown     1.6     2017-06-15 CRAN (R 3.4.0)                  
 rprojroot     1.2     2017-01-16 CRAN (R 3.4.0)                  
 rstudioapi    0.7     2017-09-07 cran (@0.7)                     
 rvest         0.3.2   2016-06-17 CRAN (R 3.4.0)                  
 scales        0.5.0   2017-08-24 cran (@0.5.0)                   
 sessioninfo * 1.0.0   2017-06-21 CRAN (R 3.4.1)                  
 shiny         1.0.5   2017-08-23 cran (@1.0.5)                   
 stringi       1.1.5   2017-04-07 CRAN (R 3.4.0)                  
 stringr       1.2.0   2017-02-18 CRAN (R 3.4.0)                  
 tibble      * 1.3.4   2017-08-22 cran (@1.3.4)                   
 tidyr       * 0.7.2   2017-10-16 CRAN (R 3.4.2)                  
 tidyverse   * 1.1.1   2017-01-27 CRAN (R 3.4.0)                  
 withr         2.0.0   2017-10-25 Github (jimhester/withr@a43df66)
 xml2          1.1.1   2017-01-24 CRAN (R 3.4.0)                  
 xtable        1.8-2   2016-02-05 CRAN (R 3.4.0)                  
 yaml          2.1.14  2016-11-12 CRAN (R 3.4.0)

Survey on R Views