<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Big Data on R Views</title>
    <link>https://rviews.rstudio.com/tags/big-data/</link>
    <description>Recent content in Big Data on R Views</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Wed, 17 Jul 2019 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://rviews.rstudio.com/tags/big-data/" rel="self" type="application/rss+xml" />
    
    
    
    
    <item>
      <title>Three Strategies for Working with Big Data in R</title>
      <link>https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/</link>
      <pubDate>Wed, 17 Jul 2019 00:00:00 +0000</pubDate>
      
      <guid>https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/</guid>
      <description>
        


&lt;p&gt;For many R users, it’s obvious &lt;em&gt;why&lt;/em&gt; you’d want to use R with big data, but not so obvious how. In fact, many people (wrongly) believe that R just doesn’t work very well for big data.&lt;/p&gt;
&lt;p&gt;In this article, I’ll share three strategies for thinking about how to use big data in R, as well as some examples of how to execute each of them.&lt;/p&gt;
&lt;p&gt;By default R runs only on data that can fit into your computer’s memory. Hardware advances have made this less of a problem for many users since these days, most laptops come with at least 4-8Gb of memory, and you can get instances on any major cloud provider with terabytes of RAM. But this is still a real problem for almost any data set that could really be called &lt;em&gt;big data&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The fact that R runs on in-memory data is the biggest issue that you face when trying to use Big Data in R. The data has to fit into the RAM on your machine, and it’s not even 1:1. Because you’re actually &lt;em&gt;doing&lt;/em&gt; something with the data, a good rule of thumb is that your machine needs 2-3x the RAM of the size of your data.&lt;/p&gt;
&lt;p&gt;An other big issue for doing Big Data work in R is that data transfer speeds are extremely slow relative to the time it takes to actually do data processing once the data has transferred. For example, the time it takes to make a call over the internet from San Francisco to New York City takes over 4 times longer than reading from a standard hard drive and over 200 times longer than reading from a solid state hard drive.&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; This is an especially big problem early in developing a model or analytical project, when data might have to be pulled repeatedly.&lt;/p&gt;
&lt;p&gt;Nevertheless, there are effective methods for working with big data in R. In this post, I’ll share three strategies. And, it important to note that these strategies aren’t mutually exclusive – they can be combined as you see fit!&lt;/p&gt;
&lt;div id=&#34;strategy-1-sample-and-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Strategy 1: Sample and Model&lt;/h2&gt;
&lt;p&gt;To sample and model, you downsample your data to a size that can be easily downloaded in its entirety and create a model on the sample. Downsampling to thousands – or even hundreds of thousands – of data points can make model runtimes feasible while also maintaining statistical validity.&lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If maintaining class balance is necessary (or one class needs to be over/under-sampled), it’s reasonably simple stratify the data set during sampling.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;/post/2019-07-01-3-big-data-paradigms-for-r_files/sample_model.png&#34; alt=&#34;Illustration of Sample and Model&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Illustration of Sample and Model&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;advantages&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Advantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Speed&lt;/strong&gt; Relative to working on your entire data set, working on just a sample can drastically decrease run times and increase iteration speed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prototyping&lt;/strong&gt; Even if you’ll eventually have to run your model on the entire data set, this can be a good way to refine hyperparameters and do feature engineering for your model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Packages&lt;/strong&gt; Since you’re working on a normal in-memory data set, you can use all your favorite R packages.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;disadvantages&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Disadvantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sampling&lt;/strong&gt; Downsampling isn’t terribly difficult, but does need to be done with care to ensure that the sample is valid and that you’ve pulled enough points from the original data set.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scaling&lt;/strong&gt; If you’re using sample and model to prototype something that will later be run on the full data set, you’ll need to have a strategy (such as &lt;a href=&#34;#push-compute&#34;&gt;pushing compute to the data&lt;/a&gt;) for scaling your prototype version back to the full data set.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Totals&lt;/strong&gt; Business Intelligence (BI) tasks frequently answer questions about totals, like the count of all sales in a month. One of the other strategies is usually a better fit in this case.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;strategy-2-chunk-and-pull&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Strategy 2: Chunk and Pull&lt;/h2&gt;
&lt;p&gt;In this strategy, the data is chunked into separable units and each chunk is pulled separately and operated on serially, in parallel, or after recombining. This strategy is conceptually similar to the &lt;a href=&#34;https://en.wikipedia.org/wiki/MapReduce&#34;&gt;MapReduce&lt;/a&gt; algorithm. Depending on the task at hand, the chunks might be time periods, geographic units, or logical like separate businesses, departments, products, or customer segments.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;/post/2019-07-01-3-big-data-paradigms-for-r_files/chunk_pull.png&#34; alt=&#34;Chunk and Pull Illustration&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Chunk and Pull Illustration&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;advantages-1&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Advantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Full data set&lt;/strong&gt; The entire data set gets used.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parallelization&lt;/strong&gt; If the chunks are run separately, the problem is easy to treat as &lt;a href=&#34;https://en.wikipedia.org/wiki/Embarrassingly_parallel&#34;&gt;embarassingly parallel&lt;/a&gt; and make use of parallelization to speed runtimes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;disadvantages-1&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Disadvantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Need Chunks&lt;/strong&gt; Your data needs to have separable chunks for chunk and pull to be appropriate.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pull All Data&lt;/strong&gt; Eventually have to pull in all data, which may still be very time and memory intensive.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stale Data&lt;/strong&gt; The data may require periodic refreshes from the database to stay up-to-date since you’re saving a version on your local machine.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;push-compute&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Strategy 3: Push Compute to Data&lt;/h2&gt;
&lt;p&gt;In this strategy, the data is compressed on the database, and only the compressed data set is moved out of the database into R. It is often possible to obtain significant speedups simply by doing summarization or filtering in the database before pulling the data into R.&lt;/p&gt;
&lt;p&gt;Sometimes, more complex operations are also possible, including computing histogram and raster maps with &lt;a href=&#34;https://db.rstudio.com/dbplot/&#34;&gt;&lt;code&gt;dbplot&lt;/code&gt;&lt;/a&gt;, building a model with &lt;a href=&#34;https://cran.r-project.org/web/packages/modeldb/index.html&#34;&gt;&lt;code&gt;modeldb&lt;/code&gt;&lt;/a&gt;, and generating predictions from machine learning models with &lt;a href=&#34;https://db.rstudio.com/tidypredict/&#34;&gt;&lt;code&gt;tidypredict&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;/post/2019-07-01-3-big-data-paradigms-for-r_files/push_data.png&#34; alt=&#34;Push Compute to Data Illustration&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Push Compute to Data Illustration&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;advantages-2&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Advantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use the Database&lt;/strong&gt; Takes advantage of what databases are often best at: quickly summarizing and filtering data based on a query.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;More Info, Less Transfer&lt;/strong&gt; By compressing before pulling data back to R, the entire data set gets used, but transfer times are far less than moving the entire data set.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;disadvantages-2&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Disadvantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Database Operations&lt;/strong&gt; Depending on what database you’re using, some operations might not be supported.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database Speed&lt;/strong&gt; In some contexts, the limiting factor for data analysis is the speed of the database itself, and so pushing more work onto the database is the last thing analysts want to do.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;an-example&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;An Example&lt;/h2&gt;
&lt;p&gt;I’ve preloaded the &lt;code&gt;flights&lt;/code&gt; data set from the &lt;a href=&#34;https://cran.r-project.org/web/packages/nycflights13/index.html&#34;&gt;&lt;code&gt;nycflights13&lt;/code&gt;&lt;/a&gt; package into a PostgreSQL database, which I’ll use for these examples.&lt;/p&gt;
&lt;p&gt;Let’s start by connecting to the database. I’m using a config file here to connect to the database, one of RStudio’s &lt;a href=&#34;https://db.rstudio.com/best-practices/managing-credentials/&#34;&gt;recommended database connection methods&lt;/a&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(DBI)
library(dplyr)
library(ggplot2)

db &amp;lt;- DBI::dbConnect(
  odbc::odbc(),
  Driver = config$driver,
  Server = config$server,
  Port = config$port,
  Database = config$database,
  UID = config$uid,
  PWD = config$pwd,
  BoolsAsChar = &amp;quot;&amp;quot;
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href=&#34;https://dplyr.tidyverse.org/&#34;&gt;&lt;code&gt;dplyr&lt;/code&gt;&lt;/a&gt; package is a great tool for interacting with databases, since I can write normal R code that is translated into SQL on the backend. I could also use the &lt;a href=&#34;https://db.rstudio.com/dbi/&#34;&gt;&lt;code&gt;DBI&lt;/code&gt;&lt;/a&gt; package to send queries directly, or a &lt;a href=&#34;https://bookdown.org/yihui/rmarkdown/language-engines.html#sql&#34;&gt;SQL chunk&lt;/a&gt; in the R Markdown document.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df &amp;lt;- dplyr::tbl(db, &amp;quot;flights&amp;quot;)
tally(df)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 x 1
##        n
##    &amp;lt;int&amp;gt;
## 1 336776&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With only a few hundred thousand rows, this example isn’t close to the kind of big data that really requires a Big Data strategy, but it’s rich enough to demonstrate on.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;sample-and-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Sample and Model&lt;/h2&gt;
&lt;p&gt;Let’s say I want to model whether flights will be delayed or not. This is a great problem to sample and model.&lt;/p&gt;
&lt;p&gt;Let’s start with some minor cleaning of the data&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Create is_delayed column in database
df &amp;lt;- df %&amp;gt;%
  mutate(
    # Create is_delayed column
    is_delayed = arr_delay &amp;gt; 0,
    # Get just hour (currently formatted so 6 pm = 1800)
    hour = sched_dep_time / 100
  ) %&amp;gt;%
  # Remove small carriers that make modeling difficult
  filter(!is.na(is_delayed) &amp;amp; !carrier %in% c(&amp;quot;OO&amp;quot;, &amp;quot;HA&amp;quot;))


df %&amp;gt;% count(is_delayed)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 2 x 2
##   is_delayed      n
##   &amp;lt;lgl&amp;gt;       &amp;lt;int&amp;gt;
## 1 FALSE      194078
## 2 TRUE       132897&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These classes are reasonably well balanced, but since I’m going to be using logistic regression, I’m going to load a perfectly balanced sample of 40,000 data points.&lt;/p&gt;
&lt;p&gt;For most databases, random sampling methods don’t work super smoothly with R, so I can’t use &lt;code&gt;dplyr::sample_n&lt;/code&gt; or &lt;code&gt;dplyr::sample_frac&lt;/code&gt;. I’ll have to be a little more manual.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1028)

# Create a modeling dataset 
df_mod &amp;lt;- df %&amp;gt;%
  # Within each class
  group_by(is_delayed) %&amp;gt;%
  # Assign random rank (using random and row_number from postgres)
  mutate(x = random() %&amp;gt;% row_number()) %&amp;gt;%
  ungroup()

# Take first 20K for each class for training set
df_train &amp;lt;- df_mod %&amp;gt;%
  filter(x &amp;lt;= 20000) %&amp;gt;%
  collect()

# Take next 5K for test set
df_test &amp;lt;- df_mod %&amp;gt;%
  filter(x &amp;gt; 20000 &amp;amp; x &amp;lt;= 25000) %&amp;gt;%
  collect()

# Double check I sampled right
count(df_train, is_delayed)
count(df_test, is_delayed)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 2 x 2
##   is_delayed     n
##   &amp;lt;lgl&amp;gt;      &amp;lt;int&amp;gt;
## 1 FALSE      20000
## 2 TRUE       20000&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 2 x 2
##   is_delayed     n
##   &amp;lt;lgl&amp;gt;      &amp;lt;int&amp;gt;
## 1 FALSE       5000
## 2 TRUE        5000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s build a model – let’s see if we can predict whether there will be a delay or not by the combination of the carrier, the month of the flight, and the time of day of the flight.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mod &amp;lt;- glm(is_delayed ~ carrier + 
             as.character(month) + 
             poly(sched_dep_time, 3),
           family = &amp;quot;binomial&amp;quot;, 
           data = df_train)

# Out-of-Sample AUROC
df_test$pred &amp;lt;- predict(mod, newdata = df_test)
auc &amp;lt;- suppressMessages(pROC::auc(df_test$is_delayed, df_test$pred))
auc&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Area under the curve: 0.6425&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As you can see, this is not a great model and any modelers reading this will have many ideas of how to improve what I’ve done. But that wasn’t the point!&lt;/p&gt;
&lt;p&gt;I built a model on a small subset of a big data set. Including sampling time, this took my laptop less than 10 seconds to run, making it easy to iterate quickly as I want to improve the model. After I’m happy with this model, I could pull down a larger sample or even the entire data set if it’s feasible, or do something with the model from the sample.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;chunk-and-pull&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Chunk and Pull&lt;/h2&gt;
&lt;p&gt;In this case, I want to build another model of on-time arrival, but I want to do it per-carrier. This is exactly the kind of use case that’s ideal for chunk and pull. I’m going to separately pull the data in by carrier and run the model on each carrier’s data.&lt;/p&gt;
&lt;p&gt;I’m going to start by just getting the complete list of the carriers.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get all unique carriers
carriers &amp;lt;- df %&amp;gt;% 
  select(carrier) %&amp;gt;% 
  distinct() %&amp;gt;% 
  pull(carrier)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, I’ll write a function that&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;takes the name of a carrier as input&lt;/li&gt;
&lt;li&gt;pulls the data for that carrier into R&lt;/li&gt;
&lt;li&gt;splits the data into training and test&lt;/li&gt;
&lt;li&gt;trains the model&lt;/li&gt;
&lt;li&gt;outputs the out-of-sample AUROC (a common measure of model quality)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;carrier_model &amp;lt;- function(carrier_name) {
  # Pull a chunk of data
  df_mod &amp;lt;- df %&amp;gt;%
    dplyr::filter(carrier == carrier_name) %&amp;gt;%
    collect()
  
  # Split into training and test
  split &amp;lt;- df_mod %&amp;gt;%
    rsample::initial_split(prop = 0.9, strata = &amp;quot;is_delayed&amp;quot;) %&amp;gt;% 
    suppressMessages()
  
  # Get training data
  df_train &amp;lt;- split %&amp;gt;% rsample::training()
  
  # Train model
  mod &amp;lt;- glm(is_delayed ~ as.character(month) + poly(sched_dep_time, 3),
             family = &amp;quot;binomial&amp;quot;,
             data = df_train)
  
  # Get out-of-sample AUROC
  df_test &amp;lt;- split %&amp;gt;% rsample::testing()
  df_test$pred &amp;lt;- predict(mod, newdata = df_test)
  suppressMessages(auc &amp;lt;- pROC::auc(df_test$is_delayed ~ df_test$pred))
  
  auc
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, I’m going to actually run the carrier model function across each of the carriers. This code runs pretty quickly, and so I don’t think the overhead of parallelization would be worth it. But if I wanted to, I would replace the &lt;code&gt;lapply&lt;/code&gt; call below with a parallel backend.&lt;a href=&#34;#fn3&#34; class=&#34;footnote-ref&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(98765)
mods &amp;lt;- lapply(carriers, carrier_model) %&amp;gt;%
  suppressMessages()

names(mods) &amp;lt;- carriers&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s look at the results.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mods&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## $UA
## Area under the curve: 0.6408
## 
## $AA
## Area under the curve: 0.6041
## 
## $B6
## Area under the curve: 0.6475
## 
## $DL
## Area under the curve: 0.6162
## 
## $EV
## Area under the curve: 0.6419
## 
## $MQ
## Area under the curve: 0.5973
## 
## $US
## Area under the curve: 0.6096
## 
## $WN
## Area under the curve: 0.6968
## 
## $VX
## Area under the curve: 0.6969
## 
## $FL
## Area under the curve: 0.6347
## 
## $AS
## Area under the curve: 0.6906
## 
## $`9E`
## Area under the curve: 0.6071
## 
## $F9
## Area under the curve: 0.625
## 
## $YV
## Area under the curve: 0.7029&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So these models (again) are a little better than random chance. The point was that we utilized the chunk and pull strategy to pull the data separately by logical units and building a model on each chunk.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;push-compute-to-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Push Compute to the Data&lt;/h2&gt;
&lt;p&gt;In this case, I’m doing a pretty simple BI task - plotting the proportion of flights that are late by the hour of departure and the airline.&lt;/p&gt;
&lt;p&gt;Just by way of comparison, let’s run this first the naive way – pulling all the data to my system and then doing my data manipulation to plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;system.time(
  df_plot &amp;lt;- df %&amp;gt;%
    collect() %&amp;gt;%
    # Change is_delayed to numeric
    mutate(is_delayed = ifelse(is_delayed, 1, 0)) %&amp;gt;%
    group_by(carrier, sched_dep_time) %&amp;gt;%
    # Get proportion per carrier-time
    summarize(delay_pct = mean(is_delayed, na.rm = TRUE)) %&amp;gt;%
    ungroup() %&amp;gt;%
    # Change string times into actual times
    mutate(sched_dep_time = stringr::str_pad(sched_dep_time, 4, &amp;quot;left&amp;quot;, &amp;quot;0&amp;quot;) %&amp;gt;% 
             strptime(&amp;quot;%H%M&amp;quot;) %&amp;gt;% 
             as.POSIXct())) -&amp;gt; timing1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that wasn’t too bad, just &lt;code&gt;2.366&lt;/code&gt; seconds on my laptop.&lt;/p&gt;
&lt;p&gt;But let’s see how much of a speedup we can get from chunk and pull. The conceptual change here is significant - I’m doing as much work as possible on the Postgres server now instead of locally. But using &lt;code&gt;dplyr&lt;/code&gt; means that the code change is minimal. The only difference in the code is that the &lt;code&gt;collect&lt;/code&gt; call got moved down by a few lines (to below &lt;code&gt;ungroup()&lt;/code&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;system.time(
  df_plot &amp;lt;- df %&amp;gt;%
    # Change is_delayed to numeric
    mutate(is_delayed = ifelse(is_delayed, 1, 0)) %&amp;gt;%
    group_by(carrier, sched_dep_time) %&amp;gt;%
    # Get proportion per carrier-time
    summarize(delay_pct = mean(is_delayed, na.rm = TRUE)) %&amp;gt;%
    ungroup() %&amp;gt;%
    collect() %&amp;gt;%
    # Change string times into actual times
    mutate(sched_dep_time = stringr::str_pad(sched_dep_time, 4, &amp;quot;left&amp;quot;, &amp;quot;0&amp;quot;) %&amp;gt;% 
             strptime(&amp;quot;%H%M&amp;quot;) %&amp;gt;% 
             as.POSIXct())) -&amp;gt; timing2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It might have taken you the same time to read this code as the last chunk, but this took only &lt;code&gt;0.269&lt;/code&gt; seconds to run, almost an order of magnitude faster!&lt;a href=&#34;#fn4&#34; class=&#34;footnote-ref&#34; id=&#34;fnref4&#34;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt; That’s pretty good for just moving one line of code.&lt;/p&gt;
&lt;p&gt;Now that we’ve done a speed comparison, we can create the nice plot we all came for.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_plot %&amp;gt;%
  mutate(carrier = paste0(&amp;quot;Carrier: &amp;quot;, carrier)) %&amp;gt;%
  ggplot(aes(x = sched_dep_time, y = delay_pct)) +
  geom_line() +
  facet_wrap(&amp;quot;carrier&amp;quot;) +
  ylab(&amp;quot;Proportion of Flights Delayed&amp;quot;) +
  xlab(&amp;quot;Time of Day&amp;quot;) +
  scale_y_continuous(labels = scales::percent) +
  scale_x_datetime(date_breaks = &amp;quot;4 hours&amp;quot;, 
                   date_labels = &amp;quot;%H&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2019-07-01-3-big-data-paradigms-for-r_files/figure-html/unnamed-chunk-17-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It looks to me like flights later in the day might be a little more likely to experience delays, but that’s a question for another blog post.&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://blog.codinghorror.com/the-infinite-space-between-words/&#34; class=&#34;uri&#34;&gt;https://blog.codinghorror.com/the-infinite-space-between-words/&lt;/a&gt;&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;This isn’t just a general heuristic. You’ll probably remember that the error in many statistical processes is determined by a factor of &lt;span class=&#34;math inline&#34;&gt;\(\frac{1}{n^2}\)&lt;/span&gt; for sample size &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;, so a lot of the statistical power in your model is driven by adding the first few thousand observations compared to the final millions.&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;One of the biggest problems when parallelizing is dealing with random number generation, which you use here to make sure that your test/training splits are reproducible. It’s not an insurmountable problem, but requires some careful thought.&lt;a href=&#34;#fnref3&#34; class=&#34;footnote-back&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn4&#34;&gt;&lt;p&gt;And lest you think the real difference here is offloading computation to a more powerful database, this Postgres instance is running on a container on my laptop, so it’s got exactly the same horsepower behind it.&lt;a href=&#34;#fnref4&#34; class=&#34;footnote-back&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;

        &lt;script&gt;window.location.href=&#39;https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/&#39;;&lt;/script&gt;
      </description>
    </item>
    
  </channel>
</rss>
