An R Users Guide to JSM 2019

If you are like me, and rather last minute about making a plan to get the most out of a large conference, you are just starting to think about JSM 2019 which will begin in just a few days. My plans always begin with an attempt to sleuth out the R-related sessions. While in the past it took quite a bit of work to identify talks that were likely backed by R-based calculations, this is clearly no longer the case.

Read more

Share Comments · ·

Three Strategies for Working with Big Data in R

For many R users, it’s obvious why you’d want to use R with big data, but not so obvious how. In fact, many people (wrongly) believe that R just doesn’t work very well for big data. In this article, I’ll share three strategies for thinking about how to use big data in R, as well as some examples of how to execute each of them. By default R runs only on data that can fit into your computer’s memory.

Read more

Share Comments · · · ·

Dividend Sleuthing with R

Welcome to a mid-summer edition of Reproducible Finance with R. Today, we’ll explore the dividend histories of some stocks in the S&P 500. By way of history for all you young tech IPO and crypto investors out there: way back, a long time ago in the dark ages, companies used to take pains to generate free cash flow and then return some of that free cash to investors in the form of dividends.

Read more

Share Comments · · · ·

Imagine your Data Before You Collect It

This post introduces the fabricatr package, whose role in the DeclareDesign suite of packages is to simulate data structure and variables. fabricatr helps you to think about your data before you start analysis or even collect it.

Read more

Share Comments · · · ·

May 2019: "Top 40" New CRAN Packages

Two hundred twenty-two new packages made it to CRAN in May, and it was more of an effort than usual to select the “Top 40”. Nevertheless, here they are in nine categories, Computational Methods, Data, Machine Learning, Mathematics, Medicine, Science, Statistics, Utilities and Visualization.

Read more

Share Comments · · · ·

A Gentle Introduction to tidymodels

Recently, I had the opportunity to showcase tidymodels in workshops and talks. Because of my vantage point as a user, I figured it would be valuable to share what I have learned so far. Let’s begin by framing where tidymodels fits in our analysis projects. The diagram above is based on the R for Data Science book, by Wickham and Grolemund. The version in this article illustrates what step each package covers.

Read more

Share Comments · · · ·

Equal Size kmeans

We were recently presented with a problem where the decision maker wanted to understand how their data would naturally group together. The classic technique of k-means clustering was a natural choice; it’s well known, computationally efficient, and implemented in base R via the kmeans() function. Our problem has a slight wrinkle: the decision maker wished to see the data grouped with (nearly) equal sizes. Now, a ‘true’ statistician would tell the client that the right thing to do from a theoretical perspective was to use native k-means results because some centers can simply have more nearby points than other centers.

Read more

Share Comments · · ·

reticulate, virtualenv, and Python in Linux

Roland Stevenson is a data scientist and consultant who may be reached on Linkedin. reticulate is an R package that allows us to use Python modules from within RStudio. I recently found this functionality useful while trying to compare the results of different uplift models. Though I did have R’s uplift package producing Qini charts and metrics, I also wanted to see how things looked with Wayfair’s promising pylift package.

Read more

Share Comments · · · ·

Introducing DeclareDesign, a Platform for Research Design

Research design consists of a set of choices about what research procedures to use. For example, how many subjects to interview, which questions to ask them, and what to do in the analysis phase with the data that results from these choices. We do not have good tools for assessing whether the chosen procedures are good ones. DeclareDesign is an R package for learning about, implementing, and communicating research procedures, from data collection to data analysis.

Read more

Share Comments · · ·

April 2019: "Top 40" New CRAN Packages

One hundred eighty-seven new packages made it to CRAN in April. Here are my picks for the “Top 40”, organized into ten categories: Biotechnology, Data, Econometrics, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities, and Visualization. Biotechnology genpwr v1.00: Provides functions for power and sample size calculations for genetic association studies allowing for mis-specification of the model of genetic susceptibility. The methods employed are extensions of Gauderman (2002) and Gauderman (2002).

Read more

Share Comments · · · ·