Introduction to Fama French

2018-04-11

by Jonathan Regenstein

In two previous posts, we calculated and then visualized the CAPM beta of a portfolio by fitting a simple linear model.

Today, we move beyond CAPM’s simple linear regression and explore the Fama French (FF) multi-factor model of equity risk/return. For more background, have a look at the original article published in The Journal Financial Economics, Common risk factors in the returns on stocks and bonds.

The FF model extends CAPM by regressing portfolio returns on several variables, in addition to market returns. From a general data science point of view, FF extends CAPM’s simple linear regression, where we had one independent variable, to a multiple linear regression, where we have numerous independent variables.

We are going to look at the FF 3-factor model, which tests the explanatory power of (1) market returns (same as CAPM), (2) firm size (small versus big) and (3) firm value (book to market ratio). The firm value factor is labeled as HML in FF, which stands for high-minus-low and refers to a firm’s book-to-market ratio. When we regress portfolio returns on the HML factor, we are investigating how much of the returns are the result of including stocks with a high book-to-market ratio (sometimes called the value premium, because high book-to-market stocks are called value stocks).

A large portion of this post covers importing data from the FF website and wrangling it for use with our portfolio returns. We will see that wrangling the data is conceptually easy to understand but practically time-consuming to implement. However, mashing together data from disparate sources is a necessary skill for anyone in industry that has data streams from different vendors and wants to get creative about how to use them. Once the data is wrangled, fitting the model is not time-consuming.

Today, we will be working with our usual portfolio consisting of:

+ SPY (S&P500 fund) weighted 25%
+ EFA (a non-US equities fund) weighted 25%
+ IJS (a small-cap value fund) weighted 20%
+ EEM (an emerging-mkts fund) weighted 20%
+ AGG (a bond fund) weighted 10%

Before we can calculate beta for that portfolio, we need to find portfolio monthly returns, which was covered in my previous post, Introduction to Portfolio Returns. I won’t go through the logic again, but the code is here:

library(tidyquant)
library(tidyverse)
library(timetk)
library(broom)
library(glue)

symbols <- c("SPY","EFA", "IJS", "EEM","AGG")

prices <- 
  getSymbols(symbols, src = 'yahoo', 
             from = "2012-12-31",
             to = "2017-12-31",
             auto.assign = TRUE, warnings = FALSE) %>% 
  map(~Ad(get(.))) %>%
  reduce(merge) %>% 
  `colnames<-`(symbols)

w <- c(0.25, 0.25, 0.20, 0.20, 0.10)

asset_returns_long <-  
  prices %>% 
  to.monthly(indexAt = "lastof", OHLC = FALSE) %>% 
  tk_tbl(preserve_index = TRUE, rename_index = "date") %>%
  gather(asset, returns, -date) %>% 
  group_by(asset) %>%  
  mutate(returns = (log(returns) - log(lag(returns)))) %>% 
  na.omit()

portfolio_returns_tq_rebalanced_monthly <- 
  asset_returns_long %>%
  tq_portfolio(assets_col  = asset, 
               returns_col = returns,
               weights     = w,
               col_rename  = "returns",
               rebalance_on = "months")

We will be working with one object of portfolio returns:

+ portfolio_returns_tq_rebalanced_monthly

Let’s get to it.

Importing and Wrangling the Fama French Factors

Our first task is to get the FF data and, fortunately, FF make their factor data available on the internet. We will document each step for importing and cleaning this data, to an extent that might be overkill. Frustrating for us now, but a time-saver later when we need to update this model or extend to the 5-factor case.

Have a look at the FF website. The data are packaged as zip files, so we will need to do a bit more than call read_csv(). Let’s use the tempfile() function from base R to create a variable called temp. This is where we will put the zipped file.

temp <- tempfile()

R has created a temporary file called temp that will be cleaned up when we exit this session. Download 3-factor zip with this link. We want to pass that to download.file() and store the result in temp.

First, though, we will break that string into three pieces: base, factor and format - this is not necessary for today’s task, but it will come in handy if we want to build a Shiny application to let a user choose a factor from the FF website, or if we just want to re-run this analysis with a different set of FF factors. We will then glue() those together and save the string as full_url.

base <- 
"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"

factor <- 
  "Global_3_Factors"

format<-
  "_CSV.zip"

full_url <-
  glue(base,
        factor,
        format,
        sep ="")

Now we pass full_url to download.file().

download.file(
full_url,
temp,
quiet = TRUE)

Finally, we can read the csv file using read_csv() after unzipping that data with the unz() function.

Global_3_Factors <- 
  read_csv(unz(temp, 
               "Global_3_Factors.csv"))

head(Global_3_Factors)

## # A tibble: 6 x 1
##   `This file was created using the 201802 Bloomberg database.`
##   <chr>                                                       
## 1 Missing data are indicated by -99.99.                       
## 2 <NA>                                                        
## 3 199007                                                      
## 4 199008                                                      
## 5 199009                                                      
## 6 199010

We have imported the dataset, but we do not see any factors, just a column with weirdly formatted dates

When this occurs, it often can be fixed by skipping a certain number of rows that contain metadata. Have a look at what happens if we skip 6 rows.

Global_3_Factors <- 
  read_csv(unz(temp, 
               "Global_3_Factors.csv"), 
    skip = 6)

head(Global_3_Factors)

## # A tibble: 6 x 5
##   X1     `Mkt-RF` SMB   HML   RF   
##   <chr>  <chr>    <chr> <chr> <chr>
## 1 199007 0.86     0.77  -0.25 0.68 
## 2 199008 -10.82   -1.60 0.60  0.66 
## 3 199009 -11.98   1.23  0.81  0.60 
## 4 199010 9.57     -7.39 -4.25 0.68 
## 5 199011 -3.86    1.22  1.14  0.57 
## 6 199012 1.10     -0.79 -1.60 0.60

This is what were were expecting, 5 columns: one called X1 that holds the weirdly formatted dates, then Mkt-Rf for the market returns above the risk-free rate, SMB for the size factor, HML for the value factor, and RF for the risk-free rate.

However, the data have been coerced to a character format - look at the class of each column:

map(Global_3_Factors, class)

## $X1
## [1] "character"
## 
## $`Mkt-RF`
## [1] "character"
## 
## $SMB
## [1] "character"
## 
## $HML
## [1] "character"
## 
## $RF
## [1] "character"

We have two options for coercing those columns to the right format. First, we can do so upon import, by supplying the argument col_types = cols(col_name = col_double(),... for each numeric column.

Global_3_Factors <- 
  read_csv(unz(temp, 
               "Global_3_Factors.csv"), 
           skip = 6,
           col_types = cols(
             `Mkt-RF` = col_double(),
             SMB = col_double(),
             HML = col_double(),
             RF = col_double()))

head(Global_3_Factors)

## # A tibble: 6 x 5
##   X1     `Mkt-RF`    SMB    HML    RF
##   <chr>     <dbl>  <dbl>  <dbl> <dbl>
## 1 199007    0.860  0.770 -0.250 0.680
## 2 199008  -10.8   -1.60   0.600 0.660
## 3 199009  -12.0    1.23   0.810 0.600
## 4 199010    9.57  -7.39  -4.25  0.680
## 5 199011   -3.86   1.22   1.14  0.570
## 6 199012    1.10  -0.790 -1.60  0.600

That works well, but it’s specific to the FF 3-factor set with those specific column names. If we imported a different FF factor set, we would need to specify different column names.

As an alternate approach, the code chunk below converts the columns to numeric after import, but is more general. It can be applied to other FF factor collections.

To do this, we rename the X1 column to date, and then use the dplyr verb mutate_at(vars(-date), as.numeric) to change our column formats to numeric. The vars() function operates like the select() function in that we can tell it to operate on all columns except the date column by putting a negative sign in front of date. This column coercion flow is more flexible in that it would work for different FF factor sets.

Global_3_Factors <- 
  read_csv(unz(temp, 
               "Global_3_Factors.csv"), 
           skip = 6) %>%
  rename(date = X1) %>% 
  mutate_at(vars(-date), as.numeric)

head(Global_3_Factors)

## # A tibble: 6 x 5
##   date   `Mkt-RF`    SMB    HML    RF
##   <chr>     <dbl>  <dbl>  <dbl> <dbl>
## 1 199007    0.860  0.770 -0.250 0.680
## 2 199008  -10.8   -1.60   0.600 0.660
## 3 199009  -12.0    1.23   0.810 0.600
## 4 199010    9.57  -7.39  -4.25  0.680
## 5 199011   -3.86   1.22   1.14  0.570
## 6 199012    1.10  -0.790 -1.60  0.600

We now have numeric data for our factors and the date column has a better label, but the wrong format.

We can use the lubridate package to parse that date string into a nicer date format. We will use the parse_date_time() function, and call the ymd() function to make sure the end result is in a date format. Again, when working with data from a new source, the date and, indeed, any column can come in many formats.

Global_3_Factors <- 
  read_csv(unz(temp, "Global_3_Factors.csv"), 
           skip = 6) %>%
  rename(date = X1) %>% 
  mutate_at(vars(-date), as.numeric) %>% 
  mutate(date = 
           ymd(parse_date_time(date, "%Y%m"))) 

head(Global_3_Factors)

## # A tibble: 6 x 5
##   date       `Mkt-RF`    SMB    HML    RF
##   <date>        <dbl>  <dbl>  <dbl> <dbl>
## 1 1990-07-01    0.860  0.770 -0.250 0.680
## 2 1990-08-01  -10.8   -1.60   0.600 0.660
## 3 1990-09-01  -12.0    1.23   0.810 0.600
## 4 1990-10-01    9.57  -7.39  -4.25  0.680
## 5 1990-11-01   -3.86   1.22   1.14  0.570
## 6 1990-12-01    1.10  -0.790 -1.60  0.600

The date format looks good, and that matters because we want to trim the factor data that the FF dates match our portfolio dates. However, notice that FF uses the first of the month and our portfolio returns use the last of the month. Again, it’s lubridate to the rescue with the rollback() function. This will roll monthly dates back to the last day of the previous month. The first date in our FF data is “1990-07-01”. Let’s roll it back.

Global_3_Factors %>% 
  select(date) %>%
  mutate(date = lubridate::rollback(date)) %>% 
  head(1)

## # A tibble: 1 x 1
##   date      
##   <date>    
## 1 1990-06-30

If we want to reset our dates to the last of the month, we need to add one first, then rollback.

Global_3_Factors %>% 
  select(date) %>%
  mutate(date = lubridate::rollback(date + months(1))) %>% 
  head(1)

## # A tibble: 1 x 1
##   date      
##   <date>    
## 1 1990-07-31

There are other ways we could have gotten around this issue - most notably, way back in the beginning, we could have indexed our portfolio returns to indexAt = firstof - but it was a good chance to introduce the rollback() function, and we will not always have that option. Sometimes two data sets are thrown at us and we have to wrangle them from there.

Finally, we want only the FF factor data that aligns with our portfolio data, so we filter() by the first() and last() date in our portfolio returns object.

Global_3_Factors <- 
  read_csv(unz(temp, "Global_3_Factors.csv"), 
           skip = 6) %>% 
  rename(date = X1) %>% 
  mutate_at(vars(-date), as.numeric) %>% 
  mutate(date = 
           rollback(ymd(parse_date_time(date, "%Y%m") + months(1)))) %>% 
  filter(date >= 
   first(portfolio_returns_tq_rebalanced_monthly$date) & date <= 
   last(portfolio_returns_tq_rebalanced_monthly$date))

head(Global_3_Factors, 3)

## # A tibble: 3 x 5
##   date       `Mkt-RF`   SMB    HML    RF
##   <date>        <dbl> <dbl>  <dbl> <dbl>
## 1 2013-01-31    5.46  0.140  2.01     0.
## 2 2013-02-28    0.100 0.330 -0.780    0.
## 3 2013-03-31    2.29  0.830 -2.03     0.

tail(Global_3_Factors, 3)

## # A tibble: 3 x 5
##   date       `Mkt-RF`    SMB    HML     RF
##   <date>        <dbl>  <dbl>  <dbl>  <dbl>
## 1 2017-10-31     1.80 -0.850 -0.950 0.0900
## 2 2017-11-30     1.93 -0.680 -0.260 0.0800
## 3 2017-12-31     1.38  0.940  0.140 0.0900

All that work enables us to merge these data objects together with left_join(...by = "date"). We also convert the FF data to decimal and create a new column called R_excess to hold our returns above the risk-free rate.

ff_portfolio_returns <- 
  portfolio_returns_tq_rebalanced_monthly %>% 
  left_join(Global_3_Factors, by = "date") %>% 
  mutate(MKT_RF = Global_3_Factors$`Mkt-RF`/100,
         SMB = Global_3_Factors$SMB/100,
         HML = Global_3_Factors$HML/100,
         RF = Global_3_Factors$RF/100,
         R_excess = round(returns - RF, 4))


head(ff_portfolio_returns, 4)

## # A tibble: 4 x 8
##   date         returns `Mkt-RF`      SMB      HML    RF  MKT_RF  R_excess
##   <date>         <dbl>    <dbl>    <dbl>    <dbl> <dbl>   <dbl>     <dbl>
## 1 2013-01-31  0.0308      5.46   0.00140  0.0201     0. 0.0546   0.0308  
## 2 2013-02-28 -0.000870    0.100  0.00330 -0.00780    0. 0.00100 -0.000900
## 3 2013-03-31  0.0187      2.29   0.00830 -0.0203     0. 0.0229   0.0187  
## 4 2013-04-30  0.0206      3.02  -0.0121   0.00960    0. 0.0302   0.0206

We now we have one object with our portfolio returns and FF factors, and can proceed to the simplest part of our exercise from a coding perspective, and the only part that our bosses/colleagues/clients/investors will care about: the modeling and visualization.

Luckily, we can copy/paste the flow from our CAPM work, now that we have the data in a nice format. CAPM used simple linear regression, whereas FF uses multiple regression with many independent variables. Accordingly, our 3-factor FF equation is lm(R_excess ~ MKT_RF + SMB + HML.

We will make one addition to the CAPM code flow, which is to include the 95% confidence interval for our coefficients. We do that by setting tidy(model, conf.int = T, conf.level = .95).

ff_dplyr_byhand <-
  ff_portfolio_returns %>% 
  do(model = 
       lm(R_excess ~ MKT_RF + SMB + HML, 
                data = .)) %>% 
  tidy(model, conf.int = T, conf.level = .95)

ff_dplyr_byhand %>% 
  mutate_if(is.numeric, funs(round(., 3))) %>% 
  select(-statistic)

##          term estimate std.error p.value conf.low conf.high
## 1 (Intercept)   -0.001     0.001   0.195   -0.004     0.001
## 2      MKT_RF    0.894     0.036   0.000    0.823     0.965
## 3         SMB    0.056     0.076   0.458   -0.095     0.208
## 4         HML    0.030     0.061   0.629   -0.092     0.151

Our model object now contains a conf.high and conf.low column to hold our confidence interval min and max values.

We can pipe these results to ggplot() and create a scatter of coefficients with confidence intervals. I don’t want to plot the intercept so will filter it out of the code flow.

We add the confidence intervals with geom_errorbar(aes(ymin = conf.low, ymax = conf.high)).

ff_dplyr_byhand %>% 
  mutate_if(is.numeric, funs(round(., 3))) %>%
  filter(term != "(Intercept)") %>% 
  ggplot(aes(x = term, y = estimate, shape = term, color = term)) + 
  geom_point() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) +
  labs(title = "FF 3-Factor Coefficients for Our Portfolio",
       subtitle = "nothing in this post is investment advice",
       x = "",
       y = "coefficient",
       caption = "data source: Fama French website and yahoo! Finance") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        plot.caption  = element_text(hjust = 0))

The results here are predictable because, as with CAPM, we are regressing a portfolio that contains the market on 3 factors, one of which is the market. Thus, the market factor dominates this model and the other two factors contain zero in their confidence bands.

Next time we will complicate this work by calculating FF factor coefficients for multiple funds/portfolios, examining rolling R squared’s and visualizing the results. See you then!