How to Scrape and Store Strava Data Using R

2021-11-22

by Julian During

This post by Julian During is the third place winner in the Call for Documentation contest. Julian is a data scientist from Germany working in the manufacturing industry. Julian loves working with R (especially the tidyverse ecosystem), sports, black coffee and cycling.

I am an avid runner and cyclist. For the past couple of years, I have recorded almost all my activities on some kind of GPS device.

I record my runs with a Garmin device and my bike rides with a Wahoo device, and I synchronize both accounts on Strava. I figured that it would be nice to directly access my data from my Strava account.

In the following text, I will describe the progress to get Strava data into R, process the data, and then create a visualization of activity routes. You can find the original analysis in this Github repository.

You will need the following packages:

library(tarchetypes)
library(conflicted)
library(tidyverse)
library(lubridate)
library(jsonlite)
library(targets)
library(httpuv)
library(httr)
library(pins)
library(httr)
library(fs)
library(readr)

conflict_prefer("filter", "dplyr")

Set Up Targets

The whole data pipeline is implemented with the help of the targets package. You can learn more about the package and its functionalities here.

In order to reproduce the analysis, perform the following steps:

Clone the repository: https://github.com/duju211/pin_strava
Install the packages listed in the ‘libraries.R’ file
Run the target pipeline by executing targets::tar_make() command
Follow the instructions printed in the console

Target Plan

The manifest of the target plan looks like this:

name	command	pattern	cue_mode
my_app	define_strava_app()	NA	thorough
my_endpoint	define_strava_endpoint()	NA	thorough
act_col_types	list(moving = col_logical(), velocity_smooth = col_number(), grade_smooth = col_number(), distance = col_number(), altitude = col_number(), time = col_integer(), lat = col_number(), lng = col_number(), cadence = col_integer(), watts = col_integer())	NA	thorough
my_sig	define_strava_sig(my_endpoint, my_app)	NA	always
df_act_raw	read_all_activities(my_sig)	NA	thorough
df_act	pre_process_act(df_act_raw, athlete_id)	NA	thorough
act_ids	pull(distinct(df_act, id))	NA	thorough
df_meas	read_activity_stream(act_ids, my_sig)	map(act_ids)	never
df_meas_all	bind_rows(df_meas)	NA	thorough
df_meas_wide	meas_wide(df_meas_all)	NA	thorough
df_meas_pro	meas_pro(df_meas_wide)	NA	thorough
gg_meas	vis_meas(df_meas_pro)	NA	thorough
df_meas_norm	meas_norm(df_meas_pro)	NA	thorough
gg_meas_save	save_gg_meas(gg_meas)	NA	thorough

We will go through the most important targets in detail.

OAuth Dance from R

The Strava API requires an ‘OAuth dance’, described below.

1. Create an OAuth Strava app
To get access to your Strava data from R, you must first create a Strava API. The steps are documented on the Strava Developer site. While creating the app, you’ll have to give it a name. In my case, I named it r_api.

After you have created your personal API, you can find your Client ID and Client Secret variables in the Strava API settings. Save the Client ID as STRAVA_KEY and the Client Secret as STRAVA_SECRET in your R environment.¹

STRAVA_KEY=<Client ID>
STRAVA_SECRET=<Client Secret>

Then, you can run the function define_strava_app shown below.

name	command	pattern	cue_mode
my_app	define_strava_app()	NA	thorough

define_strava_app <- function() {
  oauth_app(
    appname = "r_api",
    key = Sys.getenv("STRAVA_KEY"),
    secret = Sys.getenv("STRAVA_SECRET")
  )
}

2. Define an endpoint

Define an endpoint called my_endpoint using the function define_strava_endpoint.

The authorize parameter describes the authorization url and the access argument exchanges the authenticated token.

name	command	pattern	cue_mode
my_endpoint	define_strava_endpoint()	NA	thorough

define_strava_endpoint <- function() {
  oauth_endpoint(request = NULL,
                 authorize = "https://www.strava.com/oauth/authorize",
                 access = "https://www.strava.com/oauth/token")
}

3. The final authentication step
Before you can execute the following steps, you have to authenticate the API in the web browser.

name	command	pattern	cue_mode
my_sig	define_strava_sig(my_endpoint, my_app)	NA	always

define_strava_sig <- function(endpoint, app) {
  oauth2.0_token(
    endpoint,
    app,
    scope = "activity:read_all,activity:read,profile:read_all",
    type = NULL,
    use_oob = FALSE,
    as_header = FALSE,
    use_basic_auth = FALSE,
    cache = FALSE
  )
}

The information in my_sig can now be used to access Strava data. Set the cue_mode of the target to ‘always’ so that the following API calls are always executed with an up-to-date authorization token.

Access Activities

You are now authenticated and can directly access your Strava data.

1. Load all activities
Load a table that gives an overview of all the activities from the data. Because the total number of activities is unknown, use a while loop. It will break the execution of the loop if there are no more activities to read.

name	command	pattern	cue_mode
df_act_raw	read_all_activities(my_sig)	NA	thorough

read_all_activities <- function(sig) {
  activities_url <- parse_url("https://www.strava.com/api/v3/athlete/activities")

  act_vec <- vector(mode = "list")
  df_act <- tibble::tibble(init = "init")
  i <- 1L

  while (nrow(df_act) != 0) {
    r <- activities_url %>%
      modify_url(query = list(
        access_token = sig$credentials$access_token[[1]],
        page = i
      )) %>%
      GET()

    df_act <- content(r, as = "text") %>%
      fromJSON(flatten = TRUE) %>%
      as_tibble()
    if (nrow(df_act) != 0)
      act_vec[[i]] <- df_act
    i <- i + 1L
  }

  df_activities <- act_vec %>%
    bind_rows() %>%
    mutate(start_date = ymd_hms(start_date))
}

The resulting data frame consists of one row per activity:

## # A tibble: 605 x 60
##    resource_state name  distance moving_time elapsed_time total_elevation~ type 
##             <int> <chr>    <dbl>       <int>        <int>            <dbl> <chr>
##  1              2 "Hes~   31153.        4699         5267            450   Ride 
##  2              2 "Bam~    5888.        2421         2869            102.  Run  
##  3              2 "Lin~   33208.        4909         6071            430   Ride 
##  4              2 "Mon~   74154.       10721        12500            641   Ride 
##  5              2 "Cha~   34380         5001         5388            464.  Ride 
##  6              2 "Mor~    5518.        2345         2563             49.1 Run  
##  7              2 "Bin~   10022.        3681         6447            131   Run  
##  8              2 "Tru~   47179.        8416        10102            898   Ride 
##  9              2 "Sho~   32580.        5646         6027            329.  Ride 
## 10              2 "Mit~   33862.        5293         6958            372   Ride 
## # ... with 595 more rows, and 53 more variables: workout_type <int>, id <dbl>,
## #   external_id <chr>, upload_id <dbl>, start_date <dttm>,
## #   start_date_local <chr>, timezone <chr>, utc_offset <dbl>,
## #   start_latlng <list>, end_latlng <list>, location_city <lgl>,
## #   location_state <lgl>, location_country <chr>, start_latitude <dbl>,
## #   start_longitude <dbl>, achievement_count <int>, kudos_count <int>,
## #   comment_count <int>, athlete_count <int>, photo_count <int>, ...

2. Preprocess activities
Make sure that all ID columns have a character format and improve the column names.

name	command	pattern	cue_mode
df_act	pre_process_act(df_act_raw, athlete_id)	NA	thorough

pre_process_act <- function(df_act_raw, athlete_id) {
  df_act <- df_act_raw %>%
    mutate(across(contains("id"), as.character),
           `athlete.id` = athlete_id)
}

3. Extract activity IDs
Use dplyr::pull() to extract all activity IDs.

name	command	pattern	cue_mode
act_ids	pull(distinct(df_act, id))	NA	thorough

Read Measurements

1. Read the ‘stream’ data from Strava
A ‘stream’ is a nested list (JSON format) with all available measurements of the corresponding activity.

To get the
available variables and turn the result into a data frame, define a helper function read_activity_stream. This function takes an ID of an activity and an authentication token, which you created earlier.

The target is defined with dynamic branching which maps over all activity IDs. Define the cue mode as never to make sure that every target runs exactly once.

name	command	pattern	cue_mode
df_meas	read_activity_stream(act_ids, my_sig)	map(act_ids)	never

read_activity_stream <- function(id, sig) {
  act_url <-
    parse_url(stringr::str_glue("https://www.strava.com/api/v3/activities/{id}/streams"))
  access_token <- sig$credentials$access_token[[1]]

  r <- modify_url(act_url,
                  query = list(
                    access_token = access_token,
                    keys = str_glue(
                      "distance,time,latlng,altitude,velocity_smooth,cadence,watts,
                      temp,moving,grade_smooth"
                    )
                  )) %>%
    GET()

  stop_for_status(r)

  fromJSON(content(r, as = "text"), flatten = TRUE) %>%
    as_tibble() %>%
    mutate(id = id)
}

2. Bind the single targets into one data frame
You can do this using dplyr::bind_rows().

name	command	pattern	cue_mode
df_meas_all	bind_rows(df_meas)	NA	thorough

The data now is represented by one row per measurement series:

## # A tibble: 4,821 x 6
##    type            data              series_type original_size resolution id    
##    <chr>           <list>            <chr>               <int> <chr>      <chr> 
##  1 moving          <lgl [4,706]>     distance             4706 high       62186~
##  2 latlng          <dbl [4,706 x 2]> distance             4706 high       62186~
##  3 velocity_smooth <dbl [4,706]>     distance             4706 high       62186~
##  4 grade_smooth    <dbl [4,706]>     distance             4706 high       62186~
##  5 distance        <dbl [4,706]>     distance             4706 high       62186~
##  6 altitude        <dbl [4,706]>     distance             4706 high       62186~
##  7 heartrate       <int [4,706]>     distance             4706 high       62186~
##  8 time            <int [4,706]>     distance             4706 high       62186~
##  9 moving          <lgl [301]>       distance              301 high       62138~
## 10 latlng          <dbl [301 x 2]>   distance              301 high       62138~
## # ... with 4,811 more rows

3. Turn the data into a wide format

name	command	pattern	cue_mode
df_meas_wide	meas_wide(df_meas_all)	NA	thorough

meas_wide <- function(df_meas) {
  pivot_wider(df_meas, names_from = type, values_from = data)
}

In this format, every activity is one row again:

## # A tibble: 605 x 14
##    series_type original_size resolution id         moving latlng velocity_smooth
##    <chr>               <int> <chr>      <chr>      <list> <list> <list>         
##  1 distance             4706 high       6218628649 <lgl ~ <dbl ~ <dbl [4,706]>  
##  2 distance              301 high       6213800583 <lgl ~ <dbl ~ <dbl [301]>    
##  3 distance             4905 high       6179655557 <lgl ~ <dbl ~ <dbl [4,905]>  
##  4 distance            10640 high       6160486739 <lgl ~ <dbl ~ <dbl [10,640]> 
##  5 distance             4969 high       6153936896 <lgl ~ <dbl ~ <dbl [4,969]>  
##  6 distance             2073 high       6115020306 <lgl ~ <dbl ~ <dbl [2,073]>  
##  7 distance             1158 high       6097842884 <lgl ~ <dbl ~ <dbl [1,158]>  
##  8 distance             8387 high       6091990268 <lgl ~ <dbl ~ <dbl [8,387]>  
##  9 distance             5587 high       6073551706 <lgl ~ <dbl ~ <dbl [5,587]>  
## 10 distance             5281 high       6057232328 <lgl ~ <dbl ~ <dbl [5,281]>  
## # ... with 595 more rows, and 7 more variables: grade_smooth <list>,
## #   distance <list>, altitude <list>, heartrate <list>, time <list>,
## #   cadence <list>, watts <list>

4. Preprocess and unnest the data

The column latlng needs special attention, because it contains latitude and longitude information. Separate the two measurements before unnesting all list columns.

name	command	pattern	cue_mode
df_meas_pro	meas_pro(df_meas_wide)	NA	thorough

meas_pro <- function(df_meas_wide) {
  df_meas_wide %>%
    mutate(
      lat = map_if(
        .x = latlng,
        .p = ~ !is.null(.x),
        .f = ~ .x[, 1]
      ),
      lng = map_if(
        .x = latlng,
        .p = ~ !is.null(.x),
        .f = ~ .x[, 2]
      )
    ) %>%
    select(-c(latlng, original_size, resolution, series_type)) %>%
    unnest(where(is_list))
}

After this step, every row is one point in time and every column is a measurement at this point in time (if there was any activity at that moment).

## # A tibble: 2,176,926 x 12
##    id      moving velocity_smooth grade_smooth distance altitude heartrate  time
##    <chr>   <lgl>            <dbl>        <dbl>    <dbl>    <dbl>     <dbl> <dbl>
##  1 621862~ FALSE             0             1.8      0       527        149     0
##  2 621862~ TRUE              0             1.2      5       527.       150     1
##  3 621862~ TRUE              0             0.9     10.9     527.       150     2
##  4 621862~ TRUE              5.68          0.8     17       527.       150     3
##  5 621862~ TRUE              5.81          0.8     23.3     527.       150     4
##  6 621862~ TRUE              5.88          0.8     29.4     527.       150     5
##  7 621862~ TRUE              6.13          0.8     35.6     527.       151     6
##  8 621862~ TRUE              6.15          0       41.6     527.       150     7
##  9 621862~ TRUE              6.14          0       47.8     527.       150     8
## 10 621862~ TRUE              6.13          0.8     53.9     527.       150     9
## # ... with 2,176,916 more rows, and 4 more variables: cadence <dbl>,
## #   watts <dbl>, lat <dbl>, lng <dbl>

Create Visualisation

Visualize the final data by displaying the geospatial information in the data. Every facet is one activity. Keep the rest of the plot as minimal as possible.

name	command	pattern	cue_mode
gg_meas	vis_meas(df_meas_pro)	NA	thorough

vis_meas <- function(df_meas_pro) {
  df_meas_pro %>%
    filter(!is.na(lat)) %>%
    ggplot(aes(x = lng, y = lat)) +
    geom_path() +
    facet_wrap( ~ id, scales = "free") +
    theme(
      axis.line = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      legend.position = "bottom",
      panel.background = element_blank(),
      panel.border = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      plot.background = element_blank(),
      strip.text = element_blank()
    )
}

And there it is: All your Strava data in a few tidy data frames and a nice-looking plot. Future updates to the data shouldn’t take too long, because only measurements from new activities will be downloaded. With all your Strava data up to date, there are a lot of appealing possibilities for further data analyses of your fitness data.

Note from the Editor: Julian’s post neatly breaks down complex tasks, walking readers through the steps as well as rationale of his decisions. His use of the targets package demonstrates how an organized workflow enables replicability and ease. In addition, Julian showcases how the R programming language can fulfill a vision sparked by one’s passions. It is an inspiring example of how we can use R to create something that is informative, beautiful, and personal.

You can edit your R environment by running usethis::edit_r_environ(), saving the keys, and then restarting R.↩︎