Civic Data Wrangling: in R and on data.world

by Jennifer Thompson (in collaboration with Rafael Pereira)

One of the most valuable things I have learned working on Data for Democracy’s Medicare drug spending project has been the value of collaborative tools. It has been my first in-depth experience using Github collaboratively, for one, but it has also introduced me to data.world.

data.world is an intuitive way to store, organize, explore, and visualize individual data files, making them more visible to and usable for anyone who’s interested. Using the data.world R package, I was able to collect the data I needed for one dashboard, create and publish a derived dataset, and build a Shiny dashboard that pulls live data from the site — all with complete visibility to my D4D teammates, who can now directly and easily access both the original and derived datasets.

The Beginning

Back in December, I started seeing posts about Data for Democracy floating around, and the idea immediately piqued my interest. I joined up not knowing what I would be doing exactly.

One of the ideas among pilot projects suggested by D4D leader Jonathon Morgan was looking at Medicare spending. Specifically, starting with the prescription spending data for 2011-2015 released by the Centers for Medicare and Medicaid Services. Out of all the original project ideas (check out the current and growing list of projects and be amazed), this one was most in my wheelhouse, and I jumped into the discussions to see what might happen.

D4D started developing some leadership structure, and Matt Gawarecki and I were asked to lead the drug spending project. We began by tidying the Medicare Part D spending data available from CMS.gov, which was straightforward enough (thanks, CMS!), and initially stored all that on our Github repo.

Soon, our A+ team of volunteers found data on therapeutic classes, lobbying expenditures, manufacturers, and more. As we looked for a better place to store and collaborate on our datasets, D4D projects started to leverage data.world. We decided that data.world was a good place for our data, too.

At that point, one of the first things we wanted to do was look at spending and prescribing patterns for specific medications, which would help us see what we were dealing with and generate some ideas for future analysis. James Porter drafted a Shiny app for visualizing users, claims, and costs for the most commonly prescribed medications over the five years of data available. I built on top of James’ work, and the full source code is available on Github.

The Data

Setup and Initial Analysis

Once we got the CMS spending data pulled out of its original Excel format, tidied, and uploaded, I used the data.world package to query the schema metadata and see exactly how tables were named, choose the right files, and pull them directly into R.

(To run this, you’ll need your data.world API token, which I saved as DW_API in an .Renviron file to make my code shareable.)

library(tidyverse)
library(data.world)
# API token is saved in .Renviron

data.world::set_config(cfg_env("DW_API"))

# Datasets are identified by their URL
drugs_ds <- "https://data.world/data4democracy/drug-spending"

# List tables
data_list <- data.world::query(
  qry_sql("SELECT * FROM Tables"),
  dataset = drugs_ds)

# data_list is a tbl_df with two columns: tableID and tableName.
data_list$tableName
##  [1] "FDA_NDC_Product"             "Data"                       
##  [3] "Methods"                     "Variables"                  
##  [5] "Pharma_Lobby"                "atc-codes"                  
##  [7] "companies_drugs_keyed"       "drug_list"                  
##  [9] "drug_uses"                   "drugdata_clean"             
## [11] "drugnames_withclasses"       "lobbying_keyed"             
## [13] "manufacturers_drugs_cleaned" "meps_full_2014"             
## [15] "spending-2011"               "spending-2012"              
## [17] "spending-2013"               "spending-2014"              
## [19] "spending-2015"               "spending_all_top100"        
## [21] "usp_drug_classification"

Yearly Data Consolidation

The CMS data is stored in the five spending-201x tables, one per year. I read those directly from data.world by writing a simple SQL query identifying those table names, then used dplyr and purrr to quickly add a year variable to each and combine them:

# Function to read in a single year's CSV from data.world and add year
get_year <- function(yr) {
  data.world::query(qry_sql(paste0("SELECT * FROM `spending-", yr, "`")),
    dataset = drugs_ds)[,-1] %>%
    ## First column is a row number; don"t need that
    mutate(year = yr)
}

# Read in and combine all years' data
spend <- map_df(2011:2015, get_year)

Saving The Work

Once I had the final dataset with just our top 100, I uploaded it directly to data.world as a CSV via the dwapi package, which is bundled, and loaded automatically, with data.world:

# Write final file to data.world using package ------------------------------------

# data.world's REST API methods are available via the dwapi package
dwapi::upload_data_frame(
  dataset = drugs_ds,
  file_name = "spending_all_top100.csv",
  data_frame = spend_all_top100
)

The App

Now that the initial data management is done, the app’s global.R reads the “top 100” CSV directly from data.world, and we’re all set.

# Read in datasets used for all plots ----------------------------------------
library(data.world)

# API token is saved in .Renviron (DW_API)
data.world::set_config(cfg_env("DW_API"))

# Read in dataset from data.world
drug_costs_everything <-
  data.world::query(
    qry_sql("SELECT * FROM `spending_all_top100`"),
    dataset = "data4democracy/drug-spending")

The app is available to the public and can be accessed at https://jenniferthompson.shinyapps.io/shinydashboard-medicared/.

What’s next?

The current app only uses a few of the datasets that are available on data.world, but we hope to be able to classify drugs by therapeutic class (e.g., all diabetes medications) soon. Once we have that information stored on data.world, we will be able to read data from both tables, join them, and give users the option to visualize some information by therapeutic class as well as by individual medication.

This is harder than one might think, and we welcome expertise from those who might try!

Finally, if you have tried data.world and have used it for your projects, make sure to share your stories, suggestions and complaints with the team.


About the author

Jennifer Thompson is a well-rounded conversationalist and a standup woman who spends her days as a biostatistician and avid R user. Jennifer is @jthompson on data.world and @jent103 on Twitter.

Share Comments · · · ·

You may leave a comment below or discuss the post in the forum community.rstudio.com.