The Joint Statistical Meetings offer an astounding number of talks. It is impossible for an individual to see more than a small portion of what is going on. Even so, a diligent attendee ought to come away with more than a few good ideas. The following are two big ideas that I got from the conference.
Session 149, an invited panel on Theory versus Practice which featured an All-Star team of panelists (Edward George, Trevor Hastie, Elizaveta Levina, John Petkau, Nancy Reid, Richard J Samworth, Robert Tibshirani, Larry Wasserman and Bin Yu), covered a lot of ground and wove a rich tapestry of ideas. A persistent theme among many of the discussions was the worry that the paper publication process was undermining the quality of statistical results. Pressed to “sell” their ideas to journal editors, and constrained by publication space authors are being conditioned to emphasize the evidence for their results and neglect limitations or cases where their methods don’t perform well.
The big idea that really struck me was the notion articulated by Rob Tibshirani that simulation is the practical way to express healthy scientific skepticism that can be incorporated into both theoretical and applied papers without significantly increasing the papers length or complexity. (For the purposes of reproducibility, almost all of the simulation work can be submitted as supplementary material or stashed on a GitHub site.) For theoretical papers, authors could use simulation to examine underlying assumptions and determine which are most important, while authors of applied papers could point out cases where their methods or algorithms don’t work particularly well. Tibshirani noted that every model has its Achilles heel, and went so far as to suggest that every paper ought to have at least one table that exposes the weaknesses of a model or algorithm.
For researchers working in R, including simulations should add no additional burden as Monte Carlo simulation capabilities are built into the core of the language. (If you are new to R, you might find this brief tutorial by Corey Chivers helpful in getting started with simulating from statistical models.)
The second big idea came in Section 271, the Invited Special Presentation: Introductory Overview Lecture: Reproducibility, Efficient Workflows, and Rich Environments]. In her talk, How Computational Environments Can (Unexpectedly) Influence Statistical Findings Victoria Stodden elaborated on the idea that “As statistical research typically takes place in a constructed environment in silico, the findings may not be independent of this environment”. To help establish the pedigree of her ideas, Stodden briefly quoted David Donoho’s famous remark on a scientific paper only being the advertising for scientific work and not the scholarship itself. The paragraph surrounding this remark is illuminating. In his 2010 paper, An invitation to reproducible computational research, Donoho writes:
I was inspired more than 15 years ago by John Claerbout, an earth scientist at Stanford, to begin practicing reproducible computational science. See Claerbout and Karrenbach (1992). He pointed out to me, in a way paraphrased in Buckheit and Donoho (1995): “an article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.” This struck me as getting to the heart of a central problem in modern scientific publication. Most of the work in a modern research project is hidden in computational scripts that go to produce the reported results. If these scripts are not brought out into the open, no one really knows what was done in a certain project…
Here we have an assertion of the essential scientific value of software and code in a thread that traces the need for reproducible research back quite a few years to a collaboration between scientists and statisticians.
Immediately following Stodden, in his talk on Living a Reproducible Life Hadley Wickham gave a virtuoso presentation on using modern, R centric reproducible tools. (He even managed to rebase a GitHub repo without calling attention to it).
My main “takeaways” from these two talks were, first of all, an affirmation that the CRAN and Bioconductor repositories are themselves extremely valuable contributions to statistics. Not only do they enable the daily practice of statistics for many statisticians, by providing reference implementations (and documentation) for a vast number of models and algorithms they are the repositories of statistical knowledge.
The second takeaway is that reproducible research, long acknowledged to be essential to the scientific process, is now feasible for a large number of practitioners. Using coding tools such as
R Markdown along with infrastructure such as GitHub, it is possible to develop reproducible workflows for significant portions of a research process. R-centric reproducible tools are helping to put the science in data science.
Note that both Victoria Stodden and Rob Tibshirani, along with R Core member Michael Lawrence, will be delivering keynote presentations at the inaugural R / Medicine conference coming up September 7th and 8th in New Haven, CT.