Baltimore has the reputation of being a tough town: hot in the summer and gritty, but the convention center hosting the Joint Statistical Meetings is a pretty cool place to be. There are thousands of people here and so many sessions (over 600) that it’s just impossible to get an overview of all that’s going on. So, here are couple of snapshots from an R-focused, statistical tourist.
First Snapshot: What’s in a Vector?
Yesterday, Gabe Becker showed some eye opening work he is doing with Luke Tierney and Thomas Kalibera on features that are making their way into the R core. Gabe started his presentation by running the following code:
x = 1 : 1e15
x < 4
sort(x, decreasing = TRUE)
Don’t try this at home. It will almost certainly be unpleasant if you do, but it all ran instantaneously on Gabe’s laptop. This is a simple example of the new “ALTREP Framework” that promises considerable performance improvements to basic statistical calculations by exploiting two insights:
- Sometimes things are easy if you have some previous knowledge about your data
- Maybe it’s not always necessary to work with contiguous blocks of memory
For example, Gabe’s code ran so quickly because his new experimental version of R “knew” that x was sorted and that there were no NAs. As Gabe put it: “Knowing is a lot more than half the battle”. Another thing that R will eventually “know” is not to convert strings to characters until absolutely necessary. Preliminary work suggests that this feature will provide significant speedup to ordinary things like working with the model design matrix.
A really exciting aspect of the new work is that the R will export the ALTREP Framework making it possible for package developers to take advantage of knowledge about their particular data structures. If things go well, the new framework should be part of a 2018 release.
Second Snapshot: Statistical Learning with Big Data
In his talk on Monday, Professor Rob Tibshirani covered considerable ground that included definitions of Machine Learning and Statistical Learning, “A Consumer Report” for five popular machine learning algorithms and three case studies, one of which involved the use of a relatively new algorithm: Generalized Random Forests. The definitions:
- Machine learning constructs algorithms that can learn from data
- Statistical learning is a branch of Statistics that was born in response to Machine Learning, emphasizing statistical models and assessment of uncertainity
I think, capture the difference between that computer science and statistics mindsets.
The Consumer Report, captured in the following table, summarizes Professor Tibshirani’s current thinking on the strengths and weaknesses of the machine learning algorithms.
Although the table does offer definite advice on model selection, Professor Tibshirani also advised that practitioners experiment with several methods when dealing with difficult problems.
That’s it for now: wish you were here. I will write again soon.