Today, we move beyond CAPM’s simple linear regression and explore the Fama French (FF) multi-factor model of equity risk/return. For more background, have a look at the original article published in The Journal Financial Economics, Common risk factors in the returns on stocks and bonds.
The FF model extends CAPM by regressing portfolio returns on several variables, in addition to market returns. From a general data science point of view, FF extends CAPM’s simple linear regression, where we had one independent variable, to a multiple linear regression, where we have numerous independent variables.
We are going to look at the FF 3-factor model, which tests the explanatory power of (1) market returns (same as CAPM), (2) firm size (small versus big) and (3) firm value (book to market ratio). The
firm value factor is labeled as
HML in FF, which stands for high-minus-low and refers to a firm’s book-to-market ratio. When we regress portfolio returns on the
HML factor, we are investigating how much of the returns are the result of including stocks with a high book-to-market ratio (sometimes called the
value premium, because high book-to-market stocks are called value stocks).
A large portion of this post covers importing data from the FF website and wrangling it for use with our portfolio returns. We will see that wrangling the data is conceptually easy to understand but practically time-consuming to implement. However, mashing together data from disparate sources is a necessary skill for anyone in industry that has data streams from different vendors and wants to get creative about how to use them. Once the data is wrangled, fitting the model is not time-consuming.
Today, we will be working with our usual portfolio consisting of:
+ SPY (S&P500 fund) weighted 25% + EFA (a non-US equities fund) weighted 25% + IJS (a small-cap value fund) weighted 20% + EEM (an emerging-mkts fund) weighted 20% + AGG (a bond fund) weighted 10%
Before we can calculate beta for that portfolio, we need to find portfolio monthly returns, which was covered in my previous post, Introduction to Portfolio Returns. I won’t go through the logic again, but the code is here:
library(tidyquant) library(tidyverse) library(timetk) library(broom) library(glue) symbols <- c("SPY","EFA", "IJS", "EEM","AGG") prices <- getSymbols(symbols, src = 'yahoo', from = "2012-12-31", to = "2017-12-31", auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols) w <- c(0.25, 0.25, 0.20, 0.20, 0.10) asset_returns_long <- prices %>% to.monthly(indexAt = "lastof", OHLC = FALSE) %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, returns, -date) %>% group_by(asset) %>% mutate(returns = (log(returns) - log(lag(returns)))) %>% na.omit() portfolio_returns_tq_rebalanced_monthly <- asset_returns_long %>% tq_portfolio(assets_col = asset, returns_col = returns, weights = w, col_rename = "returns", rebalance_on = "months")
We will be working with one object of portfolio returns:
Let’s get to it.
Importing and Wrangling the Fama French Factors
Our first task is to get the FF data and, fortunately, FF make their factor data available on the internet. We will document each step for importing and cleaning this data, to an extent that might be overkill. Frustrating for us now, but a time-saver later when we need to update this model or extend to the 5-factor case.
Have a look at the FF website. The data are packaged as zip files, so we will need to do a bit more than call
read_csv(). Let’s use the
tempfile() function from base R to create a variable called
temp. This is where we will put the zipped file.
temp <- tempfile()
R has created a temporary file called
temp that will be cleaned up when we exit this session. Download 3-factor zip with this link. We want to pass that to
download.file() and store the result in
First, though, we will break that string into three pieces:
format - this is not necessary for today’s task, but it will come in handy if we want to build a Shiny application to let a user choose a factor from the FF website, or if we just want to re-run this analysis with a different set of FF factors. We will then
glue() those together and save the string as
base <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/" factor <- "Global_3_Factors" format<- "_CSV.zip" full_url <- glue(base, factor, format, sep ="")
Now we pass
download.file( full_url, temp, quiet = TRUE)
Finally, we can read the csv file using
read_csv() after unzipping that data with the
Global_3_Factors <- read_csv(unz(temp, "Global_3_Factors.csv")) head(Global_3_Factors)
## # A tibble: 6 x 1 ## `This file was created using the 201802 Bloomberg database.` ## <chr> ## 1 Missing data are indicated by -99.99. ## 2 <NA> ## 3 199007 ## 4 199008 ## 5 199009 ## 6 199010
We have imported the dataset, but we do not see any factors, just a column with weirdly formatted dates
When this occurs, it often can be fixed by skipping a certain number of rows that contain metadata. Have a look at what happens if we skip 6 rows.
Global_3_Factors <- read_csv(unz(temp, "Global_3_Factors.csv"), skip = 6) head(Global_3_Factors)
## # A tibble: 6 x 5 ## X1 `Mkt-RF` SMB HML RF ## <chr> <chr> <chr> <chr> <chr> ## 1 199007 0.86 0.77 -0.25 0.68 ## 2 199008 -10.82 -1.60 0.60 0.66 ## 3 199009 -11.98 1.23 0.81 0.60 ## 4 199010 9.57 -7.39 -4.25 0.68 ## 5 199011 -3.86 1.22 1.14 0.57 ## 6 199012 1.10 -0.79 -1.60 0.60
This is what were were expecting, 5 columns: one called
X1 that holds the weirdly formatted dates, then
Mkt-Rf for the market returns above the risk-free rate,
SMB for the size factor,
HML for the value factor, and
RF for the risk-free rate.
However, the data have been coerced to a character format - look at the class of each column:
## $X1 ##  "character" ## ## $`Mkt-RF` ##  "character" ## ## $SMB ##  "character" ## ## $HML ##  "character" ## ## $RF ##  "character"
We have two options for coercing those columns to the right format. First, we can do so upon import, by supplying the argument
col_types = cols(col_name = col_double(),... for each numeric column.
Global_3_Factors <- read_csv(unz(temp, "Global_3_Factors.csv"), skip = 6, col_types = cols( `Mkt-RF` = col_double(), SMB = col_double(), HML = col_double(), RF = col_double())) head(Global_3_Factors)
## # A tibble: 6 x 5 ## X1 `Mkt-RF` SMB HML RF ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 199007 0.860 0.770 -0.250 0.680 ## 2 199008 -10.8 -1.60 0.600 0.660 ## 3 199009 -12.0 1.23 0.810 0.600 ## 4 199010 9.57 -7.39 -4.25 0.680 ## 5 199011 -3.86 1.22 1.14 0.570 ## 6 199012 1.10 -0.790 -1.60 0.600
That works well, but it’s specific to the FF 3-factor set with those specific column names. If we imported a different FF factor set, we would need to specify different column names.
As an alternate approach, the code chunk below converts the columns to numeric after import, but is more general. It can be applied to other FF factor collections.
To do this, we rename the
X1 column to
date, and then use the
mutate_at(vars(-date), as.numeric) to change our column formats to numeric. The
vars() function operates like the
select() function in that we can tell it to operate on all columns except the
date column by putting a negative sign in front of
date. This column coercion flow is more flexible in that it would work for different FF factor sets.
Global_3_Factors <- read_csv(unz(temp, "Global_3_Factors.csv"), skip = 6) %>% rename(date = X1) %>% mutate_at(vars(-date), as.numeric) head(Global_3_Factors)
## # A tibble: 6 x 5 ## date `Mkt-RF` SMB HML RF ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 199007 0.860 0.770 -0.250 0.680 ## 2 199008 -10.8 -1.60 0.600 0.660 ## 3 199009 -12.0 1.23 0.810 0.600 ## 4 199010 9.57 -7.39 -4.25 0.680 ## 5 199011 -3.86 1.22 1.14 0.570 ## 6 199012 1.10 -0.790 -1.60 0.600
We now have numeric data for our factors and the date column has a better label, but the wrong format.
We can use the
lubridate package to parse that date string into a nicer date format. We will use the
parse_date_time() function, and call the
ymd() function to make sure the end result is in a date format. Again, when working with data from a new source, the date and, indeed, any column can come in many formats.
Global_3_Factors <- read_csv(unz(temp, "Global_3_Factors.csv"), skip = 6) %>% rename(date = X1) %>% mutate_at(vars(-date), as.numeric) %>% mutate(date = ymd(parse_date_time(date, "%Y%m"))) head(Global_3_Factors)
## # A tibble: 6 x 5 ## date `Mkt-RF` SMB HML RF ## <date> <dbl> <dbl> <dbl> <dbl> ## 1 1990-07-01 0.860 0.770 -0.250 0.680 ## 2 1990-08-01 -10.8 -1.60 0.600 0.660 ## 3 1990-09-01 -12.0 1.23 0.810 0.600 ## 4 1990-10-01 9.57 -7.39 -4.25 0.680 ## 5 1990-11-01 -3.86 1.22 1.14 0.570 ## 6 1990-12-01 1.10 -0.790 -1.60 0.600
The date format looks good, and that matters because we want to trim the factor data that the FF dates match our portfolio dates. However, notice that FF uses the first of the month and our portfolio returns use the last of the month. Again, it’s
lubridate to the rescue with the
rollback() function. This will roll monthly dates back to the last day of the previous month. The first date in our FF data is “1990-07-01”. Let’s roll it back.
Global_3_Factors %>% select(date) %>% mutate(date = lubridate::rollback(date)) %>% head(1)
## # A tibble: 1 x 1 ## date ## <date> ## 1 1990-06-30
If we want to reset our dates to the last of the month, we need to add one first, then rollback.
Global_3_Factors %>% select(date) %>% mutate(date = lubridate::rollback(date + months(1))) %>% head(1)
## # A tibble: 1 x 1 ## date ## <date> ## 1 1990-07-31
There are other ways we could have gotten around this issue - most notably, way back in the beginning, we could have indexed our portfolio returns to
indexAt = firstof - but it was a good chance to introduce the
rollback() function, and we will not always have that option. Sometimes two data sets are thrown at us and we have to wrangle them from there.
Finally, we want only the FF factor data that aligns with our portfolio data, so we
filter() by the
last() date in our portfolio returns object.
Global_3_Factors <- read_csv(unz(temp, "Global_3_Factors.csv"), skip = 6) %>% rename(date = X1) %>% mutate_at(vars(-date), as.numeric) %>% mutate(date = rollback(ymd(parse_date_time(date, "%Y%m") + months(1)))) %>% filter(date >= first(portfolio_returns_tq_rebalanced_monthly$date) & date <= last(portfolio_returns_tq_rebalanced_monthly$date)) head(Global_3_Factors, 3)
## # A tibble: 3 x 5 ## date `Mkt-RF` SMB HML RF ## <date> <dbl> <dbl> <dbl> <dbl> ## 1 2013-01-31 5.46 0.140 2.01 0. ## 2 2013-02-28 0.100 0.330 -0.780 0. ## 3 2013-03-31 2.29 0.830 -2.03 0.
## # A tibble: 3 x 5 ## date `Mkt-RF` SMB HML RF ## <date> <dbl> <dbl> <dbl> <dbl> ## 1 2017-10-31 1.80 -0.850 -0.950 0.0900 ## 2 2017-11-30 1.93 -0.680 -0.260 0.0800 ## 3 2017-12-31 1.38 0.940 0.140 0.0900
All that work enables us to merge these data objects together with
left_join(...by = "date"). We also convert the FF data to decimal and create a new column called
R_excess to hold our returns above the risk-free rate.
ff_portfolio_returns <- portfolio_returns_tq_rebalanced_monthly %>% left_join(Global_3_Factors, by = "date") %>% mutate(MKT_RF = Global_3_Factors$`Mkt-RF`/100, SMB = Global_3_Factors$SMB/100, HML = Global_3_Factors$HML/100, RF = Global_3_Factors$RF/100, R_excess = round(returns - RF, 4)) head(ff_portfolio_returns, 4)
## # A tibble: 4 x 8 ## date returns `Mkt-RF` SMB HML RF MKT_RF R_excess ## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2013-01-31 0.0308 5.46 0.00140 0.0201 0. 0.0546 0.0308 ## 2 2013-02-28 -0.000870 0.100 0.00330 -0.00780 0. 0.00100 -0.000900 ## 3 2013-03-31 0.0187 2.29 0.00830 -0.0203 0. 0.0229 0.0187 ## 4 2013-04-30 0.0206 3.02 -0.0121 0.00960 0. 0.0302 0.0206
We now we have one object with our portfolio returns and FF factors, and can proceed to the simplest part of our exercise from a coding perspective, and the only part that our bosses/colleagues/clients/investors will care about: the modeling and visualization.
Luckily, we can copy/paste the flow from our CAPM work, now that we have the data in a nice format. CAPM used simple linear regression, whereas FF uses multiple regression with many independent variables. Accordingly, our 3-factor FF equation is
lm(R_excess ~ MKT_RF + SMB + HML.
We will make one addition to the CAPM code flow, which is to include the 95% confidence interval for our coefficients. We do that by setting
tidy(model, conf.int = T, conf.level = .95).
ff_dplyr_byhand <- ff_portfolio_returns %>% do(model = lm(R_excess ~ MKT_RF + SMB + HML, data = .)) %>% tidy(model, conf.int = T, conf.level = .95) ff_dplyr_byhand %>% mutate_if(is.numeric, funs(round(., 3))) %>% select(-statistic)
## term estimate std.error p.value conf.low conf.high ## 1 (Intercept) -0.001 0.001 0.195 -0.004 0.001 ## 2 MKT_RF 0.894 0.036 0.000 0.823 0.965 ## 3 SMB 0.056 0.076 0.458 -0.095 0.208 ## 4 HML 0.030 0.061 0.629 -0.092 0.151
Our model object now contains a
conf.low column to hold our confidence interval min and max values.
We can pipe these results to
ggplot() and create a scatter of coefficients with confidence intervals. I don’t want to plot the intercept so will filter it out of the code flow.
We add the confidence intervals with
geom_errorbar(aes(ymin = conf.low, ymax = conf.high)).
ff_dplyr_byhand %>% mutate_if(is.numeric, funs(round(., 3))) %>% filter(term != "(Intercept)") %>% ggplot(aes(x = term, y = estimate, shape = term, color = term)) + geom_point() + geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) + labs(title = "FF 3-Factor Coefficients for Our Portfolio", subtitle = "nothing in this post is investment advice", x = "", y = "coefficient", caption = "data source: Fama French website and yahoo! Finance") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0))
The results here are predictable because, as with CAPM, we are regressing a portfolio that contains the market on 3 factors, one of which is the market. Thus, the market factor dominates this model and the other two factors contain zero in their confidence bands.
Next time we will complicate this work by calculating FF factor coefficients for multiple funds/portfolios, examining rolling R squared’s and visualizing the results. See you then!