Many Factor Models

Today, we will return to the Fama French (FF) model of asset returns and use it as a proxy for fitting and evaluating multiple linear models. In a previous post, we reviewed how to run the FF three-factor model on the returns of a portfolio. That is, we ran one model on one set of returns. Today, we will run multiple models on multiple streams of returns, which will allow us to compare those models and hopefully build a code scaffolding that can be used when we wish to explore other factor models.

Read more

Share Comments · · ·

A Mathematician's Perspective on Topological Data Analysis and R

A few years ago, when I first became aware of Topological Data Analysis (TDA), I was really excited by the possibility that the elegant theorems of Algebraic Topology could provide some new insights into the practical problems of data analysis. But time has passed, and the sober assessment of Larry Wasserman seems to describe where things stand. TDA is an exciting area and is full of interesting ideas. But so far, it has had little impact on data analysis.

Read more

Share Comments · · ·

In-database xgboost predictions with R

Moving predictive machine learning algorithms into large-scale production environments can present many challenges. For example, problems arise when attempting to calculate prediction probabilities (“scores”) for many thousands of subjects using many thousands of features located on remote databases. xgboost (docs), a popular algorithm for classification and regression, and the model of choice in many winning Kaggle competitions, is no exception. However, to run xgboost, the subject-features matrix must be loaded into memory, a cumbersome and expensive process.

Read more

Share Comments · · · ·

Communicating results with R Markdown

In my training as a consultant, I learned that long hours of analysis were typically followed by equally long hours of preparing for presentations. I had to turn my complex analyses into recommendations, and my success as a consultant depended on my ability to influence decision makers. I used a variety of tools to convey my insights, but over time I increasingly came to rely on R Markdown as my tool of choice.

Read more

Share Comments · · ·

Reproducible Finance, the book! And a discount for our readers

I’m thrilled to announce the release of my new book Reproducible Finance with R: Code Flows and Shiny Apps for Portfolio Analysis, which originated as a series of R Views posts in this space. The first post was written way back in November of 2016 - thanks to all the readers who have supported us along the way! If you are familiar with the R Views posts, then you probably have a pretty good sense for the book’s style, prose, and code approach, but I’d like to add a few quick words of background.

Read more

Share Comments · · ·

CRAN’s New Missing Data Task View

It is a relatively rare event, and cause for celebration, when CRAN gets a new Task View. This week the r-miss-tastic team: Julie Josse, Nicholas Tierney and Nathalie Vialaneix launched the Missing Data Task View. Even though I did some research on R packages for a post on missing values a couple of years ago, I was dumbfounded by the number of packages included in the new Task View. This single page not only describes what R has to offer with respect to coping with missing data, it is probably the world’s most complete index of statistical knowledge on the subject.

Read more

Share Comments · · · ·

Searching for R Packages

Searching for R packages is a vexing problem for both new and experienced R users. With over 13,000 packages already on CRAN, and new packages arriving at a rate of almost 200 per month, it is impossible to keep up. Package names can be almost anything, and they are rarely informative, so searching by name is of little help. I make it a point to look at all of the new packages arriving on CRAN each month, but after a month or so, when asked about packages related to some particular topic, more often than not, I have little more to offer than a vague memory that I saw something that might be useful.

Read more

Share Comments · · ·

Serendipity at R / Medicine

We knew we were on to something important early on in the process of organizing R / Medicine 2018. Even during our initial attempts to articulate the differences between this conference and R / Pharma 2018, it became clear that the focus on the use of R and statistics in clinical settings was going to be a richer topic than just the design of clinical trials. However, it wasn’t until the conference got underway that we realized there was magic in the mix of attendees.

Read more

Share Comments · · ·

September 2018: Top 40 New Packages

September was another relatively slow month for new package activity on CRAN: “only” 126 new packages by my count. My Top 40 list is heavy on what I characterize as “utilities”: packages that either extend R in some fashion or make it easier to do things in R. This month, the packages I selected fall into eight categories: Data, Finance, Machine Learning, Science, Statistics, Time Series, Utilities and Visualization.

Read more

Share Comments · · ·

Some Thoughts on R / Pharma 2018

It’s no secret that there are few industries more competitive than the pharmaceutical industry. Big money placed on long-shot bets for block-buster drugs where being first makes all the difference means a constant struggle to gain a competitive edge. So, you might find it surprising that the inaugural R / Pharma Conference held this past August on the Harvard campus in a very classy auditorium was all about collaboration. Some might also find it surprising that data scientists from competitive companies would gather to share information, but this is quite common.

Read more

Share Comments · · ·