NY R Conference

2017-04-28

by Joseph Rickert

The 2017 New York R Conference was held last weekend in Manhattan. For the third consecutive year, the organizers - a partnership including Lander Analytics, The New York Meetup and Work-Bench - pulled off a spectacular event. There was a wide range of outstanding talks, some technical and others more philosophical, a palpable sense of community and inclusiveness, great food, beer and Bloody Marys.

Fortunately, the talks were recorded. While watching the videos (which I imagine will be available in a couple of weeks) will be no substitute for the experience, I expect that the extended R community will find the talks valuable. In this post, I would like to mention just a couple of the presentations that touched on the professions and practices of data science and statistics.

In his talk, “The Humble (Programmer) Data Scientist: Essence and Accident in (Software Engineering) Data Science”, Friederike Schüür invoked an analogy between the current effort to establish data science as a profession, and the events of sixty years ago to obtain professional status for programmers. Schüür called out three works from master programmers: The Humble Programmer by Edsger Dijstra, Computer Programming as Art by Donald Knuth, and No Silver Bullet - Essence and Accident in Software Engineering. These are all classic papers, perennially relevant, and well worth reading and re-reading.

In the first paper, Dijstra writes:

Another two years later, in 1957, I married and Dutch marriage rites require you to state your profession and I stated that I was a programmer. But the municipal authorities of the town of Amsterdam did not accept it on the grounds that there was no such profession. . . So much for the slowness with which I saw the programming profession emerge in my own country. Since then I have seen more of the world, and it is my general impression that in other countries, apart from a possible shift of dates, the growth pattern has been very much the same.

Social transformations appear to happen much more quickly in the twenty-first century. Nevertheless, that the process requires many different kinds of people to alter their world views to establish a new profession seems to be an invariant.

The second talk I would like to mention was by Andrew Gelman. The title listed in the program is “Theoretical Statistics is the Theory of Applied Statistics: How to Think About What We Do”. Since Gelman spoke without slides in front of a dark screen, I am not sure that is actually the talk he gave, but whatever talk he gave, it was spellbinding. Gelman spoke so quickly and threw out so many ideas that my mental buffers were completely overwritten. I will have to wait for the video to sort out my impressions. Nevertheless, there were four quotations that I managed to write down that I think are worth considering:

“Taking something that was outside of the realm of statistics and putting it into statistics is a good idea.”
“I would like workflows to be more formalized.”
“Much of workflow is still outside of the tent of statistical theory.”
“I think we need a theory of models.”

Taken together, and I hope not taken out of context, the sentiment expressed here argues for expanding the domain of theoretical statistics to include theories of inferential models and the workflows in which they are produced.

The idea of the importance of workflows to science itself seems to be gaining some traction in the statistical community. In his yet-to-be-published but widely circulated paper, 50 years of Data Science, theoretical statistician David Donoho writes:

A crucial hidden component of variability in science is the analysis workflow. Different studies of the same intervention may follow different workflows, which may cause the studies to get different conclusions… Joshua Carp studied analysis workflows in 241 fMRI studies. He found nearly as many unique workflows as studies! In other words researchers are making up a new workflow for pretty much every study.

My take on things is that the movement to establish the profession of data science already has too much momentum behind it to stop it any time soon. Whether or not it is the data scientists or the statisticians who come to own the theories of models and workflows doesn’t matter all that much. No matter who develops these research areas, we will all benefit. The social and intellectual friction caused my the movement of data science is heating things up in a good way.