Civic Data Wrangling: in R and on

by Jennifer Thompson (in collaboration with Rafael Pereira)

One of the most valuable things I have learned working on Data for Democracy’s Medicare drug spending project has been the value of collaborative tools. It has been my first in-depth experience using Github collaboratively, for one, but it has also introduced me to is an intuitive way to store, organize, explore, and visualize individual data files, making them more visible to and usable for anyone who’s interested. Using the R package, I was able to collect the data I needed for one dashboard, create and publish a derived dataset, and build a Shiny dashboard that pulls live data from the site — all with complete visibility to my D4D teammates, who can now directly and easily access both the original and derived datasets.

The Beginning

Back in December, I started seeing posts about Data for Democracy floating around, and the idea immediately piqued my interest. I joined up not knowing what I would be doing exactly.

One of the ideas among pilot projects suggested by D4D leader Jonathon Morgan was looking at Medicare spending. Specifically, starting with the prescription spending data for 2011-2015 released by the Centers for Medicare and Medicaid Services. Out of all the original project ideas (check out the current and growing list of projects and be amazed), this one was most in my wheelhouse, and I jumped into the discussions to see what might happen.

D4D started developing some leadership structure, and Matt Gawarecki and I were asked to lead the drug spending project. We began by tidying the Medicare Part D spending data available from, which was straightforward enough (thanks, CMS!), and initially stored all that on our Github repo.

Soon, our A+ team of volunteers found data on therapeutic classes, lobbying expenditures, manufacturers, and more. As we looked for a better place to store and collaborate on our datasets, D4D projects started to leverage We decided that was a good place for our data, too.

At that point, one of the first things we wanted to do was look at spending and prescribing patterns for specific medications, which would help us see what we were dealing with and generate some ideas for future analysis. James Porter drafted a Shiny app for visualizing users, claims, and costs for the most commonly prescribed medications over the five years of data available. I built on top of James’ work, and the full source code is available on Github.

The Data

Setup and Initial Analysis

Once we got the CMS spending data pulled out of its original Excel format, tidied, and uploaded, I used the package to query the schema metadata and see exactly how tables were named, choose the right files, and pull them directly into R.

(To run this, you’ll need your API token, which I saved as DW_API in an .Renviron file to make my code shareable.)

# API token is saved in .Renviron"DW_API"))

# Datasets are identified by their URL
drugs_ds <- ""

# List tables
data_list <-
  qry_sql("SELECT * FROM Tables"),
  dataset = drugs_ds)

# data_list is a tbl_df with two columns: tableID and tableName.
##  [1] "FDA_NDC_Product"             "Data"                       
##  [3] "Methods"                     "Variables"                  
##  [5] "Pharma_Lobby"                "atc-codes"                  
##  [7] "companies_drugs_keyed"       "drug_list"                  
##  [9] "drug_uses"                   "drugdata_clean"             
## [11] "drugnames_withclasses"       "lobbying_keyed"             
## [13] "manufacturers_drugs_cleaned" "meps_full_2014"             
## [15] "spending-2011"               "spending-2012"              
## [17] "spending-2013"               "spending-2014"              
## [19] "spending-2015"               "spending_all_top100"        
## [21] "usp_drug_classification"

Yearly Data Consolidation

The CMS data is stored in the five spending-201x tables, one per year. I read those directly from by writing a simple SQL query identifying those table names, then used dplyr and purrr to quickly add a year variable to each and combine them:

# Function to read in a single year's CSV from and add year
get_year <- function(yr) {"SELECT * FROM `spending-", yr, "`")),
    dataset = drugs_ds)[,-1] %>%
    ## First column is a row number; don"t need that
    mutate(year = yr)

# Read in and combine all years' data
spend <- map_df(2011:2015, get_year)

Saving The Work

Once I had the final dataset with just our top 100, I uploaded it directly to as a CSV via the dwapi package, which is bundled, and loaded automatically, with

# Write final file to using package ------------------------------------

#'s REST API methods are available via the dwapi package
  dataset = drugs_ds,
  file_name = "spending_all_top100.csv",
  data_frame = spend_all_top100

The App

Now that the initial data management is done, the app’s global.R reads the “top 100” CSV directly from, and we’re all set.

# Read in datasets used for all plots ----------------------------------------

# API token is saved in .Renviron (DW_API)"DW_API"))

# Read in dataset from
drug_costs_everything <-
    qry_sql("SELECT * FROM `spending_all_top100`"),
    dataset = "data4democracy/drug-spending")

The app is available to the public and can be accessed at

What’s next?

The current app only uses a few of the datasets that are available on, but we hope to be able to classify drugs by therapeutic class (e.g., all diabetes medications) soon. Once we have that information stored on, we will be able to read data from both tables, join them, and give users the option to visualize some information by therapeutic class as well as by individual medication.

This is harder than one might think, and we welcome expertise from those who might try!

Finally, if you have tried and have used it for your projects, make sure to share your stories, suggestions and complaints with the team.

About the author

Jennifer Thompson is a well-rounded conversationalist and a standup woman who spends her days as a biostatistician and avid R user. Jennifer is @jthompson on and @jent103 on Twitter.

Share Comments · · · ·