Outlier Days with R and Python

2020-03-16

by Jonathan Regenstein

Welcome to another installment of Reproducible Finance. Today’s post will be topical as we look at the historical behavior of the stock market after days of extreme returns and it will also explore one of my favorite coding themes of 2020 - the power of RMarkdown as an R/Python collaboration tool.

This post originated when Rishi Singh, the founder of tiingo and one of the nicest people I have encountered in this crazy world, sent over a note about recent market volatility along with some Python code for analyzing that volatility. We thought it would be a nice project to post that Python code along with the equivalent R code for reproducing the same results. For me, it’s a great opportunity to use RMarkdown's R and Python interoperability superpowers, fueled by the reticulate package. If you are an R coder and someone sends you Python code as part of a project, RMarkdown + reticulate makes it quite smooth to incorporate that Python code into your work. It was interesting to learn how a very experienced Python coder might tackle a problem and then think about how to tackle that problem with R. Unsurprisingly, I couldn’t resist adding a few elements of data visualization.

Before we get started, if you’re unfamiliar with using R and Python chunks throughout an RMarkdown file, have a quick look at the reticulate documentation here.

Let’s get to it. Since we’ll be working with R and Python, we start with our usual R setup code chunk to load R packages, but we’ll also load the reticulate package and source a Python script. Here’s what that looks like.

library(tidyverse)
library(tidyquant)
library(riingo)
library(timetk)
library(plotly)
library(roll)
library(slider)
library(reticulate)

riingo_set_token("your tiingo token here")

# Python file that holds my tiingo token
reticulate::source_python("credentials.py")

knitr::opts_chunk$set(message = FALSE, warning = FALSE, comment = NA)

Note that I set my tiingo token twice: first using riingo_set_token() so I can use the riingo package in R chunks and then by sourcing the credentials.py file, where I have put tiingoToken = 'my token'. Now I can use the tiingoToken variable in my Python chunks. This is necessary because we will use both R and Python to pull in data from Tiingo.

Next we will use a Python chunk to load the necessary Python libraries. If you haven’t installed these yet, you can open the RStudio terminal and run pip install. Since we’ll be interspersing R and Python code chunks throughout, I will add a # Python Chunk to each Python chunk and, um, # R Chunk to each R chunk.

# Python chunk
import pandas as pd
import numpy as np
import tiingo

Let’s get to the substance. The goal today is look back at the last 43 years of S&P 500 price history and analyze how the market has performed following a day that sees an extreme return. We will also take care with how we define an extreme return, using rolling volatility to normalize percentage moves.

We will use the mutual fund VFINX as a tradeable proxy for the S&P 500 because it has a much longer history than other funds like SPY or VOO.

Let’s start by passing a URL string from tiingo to the pandas function read_csv, along with our tiingoToken.

# Python chunk 
pricesDF = pd.read_csv("https://api.tiingo.com/tiingo/daily/vfinx/prices?startDate=1976-1-1&format=csv&token=" + tiingoToken)

We just created a Python object called pricesDF. We can look at that object in an R chunk by calling py$pricesDF.

# R chunk
py$pricesDF %>% 
  head()

We just created a Python object called pricesDF. Let’s reformat the date column becomes the index, in date time format.

# Python chunk 
pricesDF = pricesDF.set_index(['date'])
pricesDF.index = pd.DatetimeIndex(pricesDF.index)

Heading back to R for viewing, we see that the date column is no longer a column - it is the index of the data frame and in pandas the index is more like a label than a new column. In fact, here’s what happens when call the row names of this data frame.

# R chunk
py$pricesDF %>% 
  head() %>% 
  rownames()

We now have our prices, indexed by date. Let’s convert adjusted closing prices to log returns and save the results in a new column called returns. Note the use of the shift(1) operator here. That is analogous to the lag(..., 1) function in dplyr.

# Python chunk
pricesDF['returns'] = np.log(pricesDF['adjClose']/pricesDF['adjClose'].shift(1))

Next, we want to calculate the 3-month rolling standard deviation of these daily log returns, and then divide daily returns by the previous rolling 3-month volatility in order to prevent look-ahead error. We can think of this as normalizing today’s return by the previous 3-months’ rolling volatility and will label it as stdDevMove.

# Python chunk
pricesDF['rollingVol'] = pricesDF['returns'].rolling(63).std()
pricesDF['stdDevMove'] = pricesDF['returns'] / pricesDF['rollingVol'].shift(1)

Finally, we eventually want to calculate how the market has performed on the day following a large negative move. To prepare for that, let’s create a column of next day returns using shift(-1).

# Python chunk
pricesDF['nextDayReturns'] = pricesDF.returns.shift(-1)

Now, we can filter by the size of the stdDevMove column and the returns column, to isolate days where the standard deviation move was at least 3 and the returns was less than -3%. We use mean() to find the mean next day return following such large events.

# Python chunk
nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < -.03), ['nextDayReturns']].mean()

Finally, let’s loop through and see how the mean next day return changes as we filter on different extreme negative returns or we can call drop tolerances. We will label the drop tolerance as i, set it at -.03 and then run a while loop that decrements down i by .0025 at each pass. In this way we can look at the mean next return following different levels of negative returns.

# Python chunk
i = -.03
while i >= -.0525:
    nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < i), ['nextDayReturns']]
    print(str(round(i, 5)) + ': ' + str(round(nextDayPerformanceSeries['nextDayReturns'].mean(), 6)))

    i -= .0025

It appears that as the size of the drop gets larger and more negative, the mean bounce back tends to get larger.

Let’s reproduce these results in R.

First, we import prices using the riingo_prices() function from the riingo.

# R chunk
sp_500_prices <- 
  "VFINX" %>% 
  riingo_prices(start_date = "1976-01-01", end_date = today())

We can use mutate() to add a column of daily returns, rolling volatility, standard deviation move and next day returns.

# R chunk
sp_500_returns <- 
  sp_500_prices %>%
  select(date, adjClose) %>% 
  mutate(daily_returns_log = log(adjClose/lag(adjClose)),
         rolling_vol = roll_sd(as.matrix(daily_returns_log), 63),
         sd_move = daily_returns_log/lag(rolling_vol),
         next_day_returns = lead(daily_returns_log))

Now let’s filter() on an sd_move greater than 3 and daily_returns_log less than a drop tolerance of -.03.

# R chunk

sp_500_returns %>% 
  na.omit() %>% 
  filter(sd_move < -3 & daily_returns_log < -.03) %>% 
  select(date, daily_returns_log, sd_move, next_day_returns) %>% 
  summarise(mean_return = mean(next_day_returns)) %>% 
  add_column(drop_tolerance = scales::percent(.03), .before = 1)

# A tibble: 1 x 2
  drop_tolerance mean_return
  <chr>                <dbl>
1 3%                 0.00625

We used a while() loop to iterate across different drop tolerances in Python, let’s see how to implement that using map_dfr() from the purrr package.

First, we will define a sequence of drop tolerances using the seq() function.

# R chunk
drop_tolerance <- seq(.03, .05, .0025)

drop_tolerance

[1] 0.0300 0.0325 0.0350 0.0375 0.0400 0.0425 0.0450 0.0475 0.0500

Next, we will create a function called outlier_mov_fun that takes a data frame of returns, filters on a drop tolerance and gives us the mean return following large negative moves.

# R chunk
outlier_mov_fun <- function(drop_tolerance, returns) {
returns %>%  
  na.omit() %>% 
  filter(sd_move < -3 & daily_returns_log < -drop_tolerance) %>% 
  select(date, daily_returns_log, sd_move, next_day_returns) %>% 
  summarise(mean_return = mean(next_day_returns) %>% round(6)) %>%
  add_column(drop_tolerance = scales::percent(drop_tolerance), .before = 1) %>% 
  add_column(drop_tolerance_raw = drop_tolerance, .before = 1)
}

Notice how that function takes two arguments: a drop tolerance and data frame of returns.

Next, we pass our sequence of drop tolerances, stored in a variable called drop_tolerance to map_dfr(), along with our function and our sp_500_returns object. map_dfr will iterate through our sequence of drops and apply our function to each one.

# R chunk
map_dfr(drop_tolerance, outlier_mov_fun, sp_500_returns) %>% 
  select(-drop_tolerance_raw)

# A tibble: 9 x 2
  drop_tolerance mean_return
  <chr>                <dbl>
1 3%                 0.00625
2 3%                 0.00700
3 3%                 0.00967
4 4%                 0.0109 
5 4%                 0.0122 
6 4%                 0.0132 
7 4%                 0.0149 
8 5%                 0.0149 
9 5%                 0.0162

Have a quick glance up that the results of our Python while() and we should see that the results are consistent.

Alright, let’s have some fun and get to visualizing these results with ggplot and plotly.

# R chunk
(
sp_500_returns %>% 
map_dfr(drop_tolerance, outlier_mov_fun, .) %>% 
  ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}
                                                                      mean next day return: {mean_return * 100}%"))) +
  geom_point(color = "cornflowerblue") +
  labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) + 
  theme_minimal()
) %>% ggplotly(tooltip = "text")

Here’s what happens when we expand the upper bound to a drop tolerance of -2% and make our intervals smaller, moving from .25% increments to .125% increments.

# R chunk
drop_tolerance_2 <- seq(.02, .05, .00125)

(
sp_500_returns %>% 
map_dfr(drop_tolerance_2, outlier_mov_fun, .) %>% 
  ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}
                                                                      mean next day return: {mean_return * 100}%"))) +
  geom_point(color = "cornflowerblue") +
  labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) + 
  theme_minimal()
) %>% ggplotly(tooltip = "text")

Check out what happens when we expand the lower bound, to a -6% drop tolerance.

# R chunk
drop_tolerance_3 <- seq(.02, .06, .00125)

(
sp_500_returns %>% 
map_dfr(drop_tolerance_3, outlier_mov_fun, .) %>% 
  ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance}
                                                                      mean next day return: {mean_return * 100}%"))) +
  geom_point(color = "cornflowerblue") +
  labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) + 
  theme_minimal()
) %>% ggplotly(tooltip = "text")

I did not expect that gap upward when the daily drop passes 5.25%.

A quick addendum that if I had gotten my act together and finished this 4 days ago I would not have included, but I’m curious how this last week has compared with other weeks in terms of volatility. I have in mind to visualize weekly return dispersion and that seemed a mighty tall task, until the brand new slider package came to the rescue! slider has a function called slide_period() that, among other things, allows us to break up time series according to different periodicities.

To break up our returns by week, we call slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week"), where first_monday_december is a date that falls on a Monday. We could use our eyeballs to check a calendar and find a date that’s a Monday or we could use some good ol’ code. Let’s assume we want to find the first Monday in December of 2016.

We first filter our data with filter(between(date, as_date("2016-12-01"), as_date("2016-12-31"))). Then create a column of weekday names with wday(date, label = TRUE, abbr = FALSE) and filter to our first value of “Monday”.

# R Chunk
first_monday_december <- 
  sp_500_returns %>%
  mutate(date = ymd(date)) %>% 
  filter(between(date, as_date("2016-12-01"), as_date("2016-12-31"))) %>% 
  mutate(day_week = wday(date, label = TRUE, abbr = FALSE)) %>% 
  filter(day_week == "Monday") %>% 
  slice(1) %>% 
  pull(date)

Now we run our slide_period_dfr() code and it will start on the first Monday in December of 2016, and break our returns into weeks. Since we set .names_to = "week", the function will create a new column called week and give a unique number to each of our weeks.

# R chunk

sp_500_returns %>%
  select(date, daily_returns_log) %>%
  filter(date >= first_monday_december) %>%
  slide_period_dfr(.,
                   .$date,
                   "week",
                   ~ .x,
                   .origin = first_monday_december,
                   .names_to = "week") %>% 
  head(10)

# A tibble: 10 x 3
    week date                daily_returns_log
   <int> <dttm>                          <dbl>
 1     1 2016-12-05 00:00:00           0.00589
 2     1 2016-12-06 00:00:00           0.00342
 3     1 2016-12-07 00:00:00           0.0133 
 4     1 2016-12-08 00:00:00           0.00226
 5     1 2016-12-09 00:00:00           0.00589
 6     2 2016-12-12 00:00:00          -0.00105
 7     2 2016-12-13 00:00:00           0.00667
 8     2 2016-12-14 00:00:00          -0.00810
 9     2 2016-12-15 00:00:00           0.00392
10     2 2016-12-16 00:00:00          -0.00172

From here, we can group_by that week column and treat each week as a discrete time period. Let’s use ggplotly to plot each week on the x-axis and the daily returns of each week on the y-axis, so that the vertical dispersion shows us the dispersion of weekly returns. Hover on the point to see the exact date of the return.

# R chunk
(
sp_500_returns %>%
  select(date, daily_returns_log) %>%
  filter(date >= first_monday_december) %>%
  slide_period_dfr(.,
                   .$date,
                   "week",
                   ~ .x,
                   .origin = first_monday_december,
                   .names_to = "week") %>%
  group_by(week) %>%
  mutate(start_week = ymd(min(date))) %>%
  ggplot(aes(x = start_week, y = daily_returns_log, text = str_glue("date: {date}"))) +
  geom_point(color = "cornflowerblue", alpha = .5) +
  scale_y_continuous(labels = scales::percent,
                     breaks = scales::pretty_breaks(n = 8)) +
  scale_x_date(breaks = scales::pretty_breaks(n = 10)) +
  labs(y = "", x = "", title = "Weekly Daily Returns") +
  theme_minimal()
) %>% ggplotly(tooltip = "text")

We can also plot the standard deviation of returns for each week.

# R chunk

(
sp_500_returns %>%
  select(date, daily_returns_log) %>%
  filter(date >= first_monday_december) %>%
  slide_period_dfr(.,
                   .$date,
                   "week",
                   ~ .x,
                   .origin = first_monday_december,
                   .names_to = "week") %>%
  group_by(week) %>%
  summarise(first_of_week = first(date),
            sd = sd(daily_returns_log)) %>%
  ggplot(aes(x = first_of_week, y = sd, text = str_glue("week: {first_of_week}"))) +
  geom_point(aes(color = sd)) +
  labs(x = "", title = "Weekly Standard Dev of Returns", y = "") +
  theme_minimal()
) %>% ggplotly(tooltip = "text")

That’s all for today! Thanks for reading and stay safe out there.