Imagine your Data Before You Collect It

2019-07-01

by Graeme Blair, Jasper Cooper, Alexander Coppock and Macartan Humphreys

As data scientists, we are often presented with a dataset and are asked to use it to produce insights. We use R to wrangle, visualize, model, and produce tables and plots for sharing or publication. When we focus on the data in hand in this way, we don’t get to consider where the data came from. The sample size and the set of variables and their scales are fixed. Yet the procedures used to gather or generate them are hugely consequential for how we should analyze the data and also the quality of the insights we can ultimately deliver. Sampling procedures have implications for how the resulting data should be analyzed. For studies that seek to measure causal effects, it matters how some units come to be treated and others left untreated.

Because these processes are so important, we wanted to make a tool that would help data scientists and other researchers imagine their data before they collect it so that any changes to process can be made before it’s too late.

When the data is already collected, the tool allows you to imagine your data before you analyze it. When we make data wrangling and modeling decisions based on the results we find under each procedure, or using model fit statistics, we are vulnerable to the unconscious biases labeled the garden of forking paths or p-hacking that may lead us to select the analysis procedure that produces the best answer. We use the actual data because we don’t have a good substitute: data with the same structure and variables that we have collected.

This post introduces the fabricatr package, whose role in the DeclareDesign suite of packages is to simulate data structure and variables. See this RViews post introducing DeclareDesign and the philosophy behind it. fabricatr helps you to think about your data before you start analysis or even collection. What are the units? How are they structured? What measurements will you take? What are their ranges and how are they correlated? fabricatr can help you simulate mock data before you collect the real data, and test out different estimation strategies without worrying about biasing your inferences.

Imagining your data structure

Most simply, fabricatr will create a single-level data structure given a number of units.

library(fabricatr)
fabricate(N = 100, temp_fahrenheit = rnorm(N, mean = 80, sd = 20))

## Warning: `is_lang()` is deprecated as of rlang 0.2.0.
## Please use `is_call()` instead.
## This warning is displayed once per session.

## Warning: `lang_name()` is deprecated as of rlang 0.2.0.
## Please use `call_name()` instead.
## This warning is displayed once per session.

ID	temp_fahrenheit
001	56.6
002	46.3
003	90.5
004	75.1
005	85.1
006	102.8

Social science data is often hierarchical. For example, schools have classrooms that have students. fabricatr shines here with the add_level command. By default, new levels are nested within the levels above them.

library(fabricatr)
fabricate(
  # five schools
  school  = add_level(N = 5,
  n_classrooms = sample(10:15, N, replace = TRUE)),
  # 10 to 15 classrooms per school
  classroom  = add_level(N = n_classrooms),
  # 15 students per classroom
  student = add_level(N = 15)
  )

## Warning: `lang_modify()` is deprecated as of rlang 0.2.0.
## Please use `call_modify()` instead.
## This warning is displayed once per session.

school	n_classrooms	classroom	student
1	12	01	001
1	12	01	002
1	12	01	003
1	12	01	004
1	12	01	005
1	12	01	006

The real world often produces messy, overlapping hierarchies. For example, student data may be collected from middle school and also high school, in which case students are nested in two different schools, but those schools are not nested within each other. Here’s how to make such “cross-classified” data. The rho parameter governs how correlated primary_rank and secondary_rank should be.

dat <- 
fabricate(
  primary_schools = add_level(N = 5, primary_rank = 1:N),
  secondary_schools = add_level(N = 6, secondary_rank = 1:N, nest = FALSE),
  students = link_levels(N = 15, by = join(primary_rank, secondary_rank, rho = 0.9))
)

## `link_levels()` calls are faster if the `mvnfast` package is installed.

ggplot(dat, aes(primary_rank, secondary_rank)) + geom_point(position = position_jitter(width = 0.1, height = 0.1), alpha = 0.5) + theme_bw()

Similarly, you can create longitudinal data via cross_levels:

fabricate(
 students = add_level(N = 2),
 years = add_level(N = 20, year = 1981:2000, nest = FALSE),
 student_year = cross_levels(by = join(students, years))
)

students	years	year	student_year
1	01	1981	01
2	01	1981	02
1	02	1982	03
2	02	1982	04
1	03	1983	05
2	03	1983	06

Imagining your variables

R has lots of great tools for simulating variables. In some cases, though, common kinds of outcome variables are surprisingly tough to simulate. fabricatr collects a small number of functions to create variable types commonly used by social scientists, with simple syntax. We describe two examples here, but see our variable creation vignette for the rest.

Variables with intra-class correlation

With the data structure tools described above, you can construct data that has within-unit and between-unit variation, for example, variation within classrooms and variation across classrooms in test scores. However, many times you want to set the level of intra-class correlation (ICC) more precisely. We help with draw_normal_icc and draw_binary_icc.

dat <- 
  fabricate(
    N = 1000,
    clusters = sample(LETTERS, N, replace = TRUE),
    Y1 = draw_normal_icc(clusters = clusters, ICC = .2),
    Y2 = draw_binary_icc(clusters = clusters, ICC = .2)
  )
ICC::ICCbare(clusters, Y1, dat)

## [1] 0.09726701

ICC::ICCbare(clusters, Y2, dat)

## [1] 0.176036

Ordered outcomes

We provide a set of tools for discrete random variables (including ordered outcomes). We take a latent variable (i.e., test_ability) and transform it into an ordered variable (test_score).

dat <- 
fabricate(
  N = 100,
  test_ability = rnorm(N),
  test_score = draw_ordered(test_ability, breaks = c(-.5, 0, .5))
)
ggplot(dat, aes(test_ability, test_score)) + geom_point() + theme_bw()

fabricatr is compatible with almost any R variable creation function. We highlight other terrific R packages that help simulate social science-relevant variables in a vignette here.

Where to go next

This post is a high-level teaser for fabricatr’s functionality, but for a deeper introduction, check out the fabricatr getting started vignette. You can also download and print this cheatsheet.

You can install fabricatr from CRAN:

install.packages("fabricatr")
library(fabricatr)

Graeme Blair is an Assistant Professor of Political Science at UCLA.

Jasper Cooper is a Postdoctoral Research Associate at the Kahneman-Treisman Center for Behavioral Science and Public Policy at Princeton University.

Alexander Coppock is an Assistant Professor of Political Science at Yale University.

Macartan Humphreys is a Professor of Political Science at Columbia University and a Director of the research group “Institutions and Political Inequality” at the WZB Berlin Social Science Center.