Book Review on R Views

Frank's R Workflow

Fri, 17 Jun 2022 00:00:00 +0000

Frank Harrell’s new eBook, R Workflow, which aims to: “to foster best practices in reproducible data documentation and manipulation, statistical analysis, graphics, and reporting” is an ambitious document that is notable on multiple levels.

To begin with, the workflow itself is much more than a simple progression of logical steps.

This workflow is clearly the result of a process forged through trial and error by a master statistician over many years. As the diagram indicates, the document takes a holistic viewpoint of a statistical analysis covering document preparation, data manipulation, statistical practice computational concerns, and more.

Then, there is the synthesis of a wide range of content into a succinct, very readable exposition that dips in to some very deep topics. Frank’s examples are streamlined presentations of analyses and code that are both sophisticated an practical. The missing value section suggests a whole array of analyses through a careful presentation of plots, and the section on data checking introduces a level of automation beyond what is commonly done.

Frank’s writing style is clear, informal and from the perspective of a teacher who wants to show you some cool things along with the basics. For example, don’t miss the if Trick in section 2.4.3.

I should mention that Frank’s eBook is not a tidyverse presentation. The code examples are built around base R, Frank’s Hmisc and rms packages and an eclectic mix of packages that include data.table. plotly and tidyverse packages haven and ggplot2. In a way, this selection of packages reflects the evolution of R itself. For example, as with many popular R packages, Hmisc most likely started out as Frank’s personal tool kit. However, after many years of Frank’s deep commitment to using R and contributing R tools, which includes seventy versions of Hmisc in nineteen years, the package has become a fundamental resource. (Have a look at the reverse depends, imports, and suggests.) Also, the mix of packages with different design philosophies underlying R Workflow reflects the flexibility of the R language and the organic growth of the R ecosystem.

Perhaps the most striking aspect of the eBook is the way Frank uses Quarto, knitr and Hmisc to build an elegant reproducible document about building reproducible documents. For example, Quarto permits the effective placement of plots in the right margins of the document, and the Quarto callouts in Section 3.4 enable the mini tutorials that include Special Considerations for Latex/pdf and Using Tooltips with Mermaid to be embedded in the document without interrupting its flow. Moreover, along with functions like Hmisc::getHdata() and Hmisc::getRs(), Quarto enables the document to achieve a high level of reproducibility by pulling data and code directly from GitHub repositories.

Not only can Frank’s R Workflow teach you some serious statistics, but studying its construction will take you a long way towards building aesthetically pleasing reproducible documents.

Frank Harrell will be delivering a keynote address on August 26th at the upcoming R/Medicine conference.

A Few Old Books

Thu, 25 Apr 2019 00:00:00 +0000

Greg Wilson is a data scientist and professional educator at RStudio.

My previous column looked at a few new books about R. In this one, I’d like to explore a few books about programming that people coming from data science backgrounds may not have stumbled upon.

The first is Michael Nygard’s Release It!, which more than lives up to its subtitle, “Design and Deploy Production-Ready Software”. Most of us can write programs that work for us on our machines; this book explores what it takes to create software that will work reliably for other people, on machines you’ve never met, long after you’ve moved on to your next project. It focuses on software that’s deployed for general use rather than installed on individuals’ machines, and covers stability patterns and anti-patterns, designing software to meet production needs, security, and a range of other pragmatic issues. You might not need to take care of these things yourself, but whoever has to get your software running on the departmental cluster will be grateful that you thought about it, and can have a sensible conversation about trade-offs.

The second book is Andreas Zeller’s Why Programs Fail, which bills itself as “a guide to pragmatic debugging”, and has been turned into a Udacity course. Programmers spend anywhere from a quarter to three quarters of their time debugging, but most only get an in-passing overview of how to do this well, and are never shown tools more advanced than print statements and break-point debuggers. Zeller starts with that, but goes much further to look at automatic and semi-automatic ways of simplifying programs to localize problems, isolating values’ origins, program slicing, anomaly detection, and much more. Some of the methods he describes will seem very familiar to data scientists, though the domain is new; others will take readers without a computer-science background into new territory in the same way that Advanced R does.

Our third entry is Michael Feathers’ Working Effectively with Legacy Code. Feathers defines legacy code as software that we’re reluctant to modify because we don’t understand how it works and are afraid of breaking. Having a comprehensive test suite allays this fear, but how can we construct one after the fact for a tangled mess of code? The bulk of the book explores answers to this question, including how to identify seams where code can be split, how to break dependencies so that parts can be improved incrementally, and so on. Some of the examples may seem a little out of date (the book is almost 15 years old), but they all apply directly to the unholy mixture of Perl, shell scripts, hundred-line SQL statements, and ten-page R scripts that you were just handed.

Number four is Jeff Johnson’s GUI Bloopers. I was in two startups in the 1990s, and in both of them, I was told after a few weeks that I was never allowed to work on the user interface again. It was the right decision, but this book might have made it unnecessary. Rather than trying to explain the rules for designing a good user interface, Johnson gives example after example of how to fix bad ones. The companion book, Web Bloopers, is less useful today because web interfaces have evolved so rapidly, but either will help you make an interface that is at least not bad.

The last entry for this post is Ashley Davis’s Data Wrangling with JavaScript. As its title suggests, it doesn’t spend very much time on statistical theory; instead, it covers the “other 90%” of squeezing answers out of data, from establishing your data pipeline and getting started with Node (a widely-used command-line version of JavaScript) to cleaning, analyzing, and visualizing data. There are lots of code samples and plenty of diagrams, and you can download both the data sets the author uses in examples and his Data-Forge library. I suspect readers will need some prior familiarity with JavaScript to dive into this, but Davis shows just how far you can go with what’s available today, and that the journey is a lot smoother than people might think.

Paid in Books: An Interview with Christian Westergaard

Thu, 07 Mar 2019 00:00:00 +0000

R is greatly benefiting from new users coming from disciplines that traditionally did not provoke much serious computation. Journalists¹ and humanist scholars², for example, are embracing R. But, does the avenue from the Humanities go both ways? In a recent conversation with Christian Westergaard, proprietor of Sophia Rare Books in Copenhagen, I was delighted to learn that it does.

JBR: I was very pleased to learn when I spoke with you recently at the California Antiquarian Book Fair that you were an S and S+ user in graduate school. What were you studying and how how was S and S+ helpful?

CW: I did a Master’s in mathematics and a Bachelor’s in statistics at the University of Copenhagen in Denmark, graduating in 2005. During the first year of my courses in statistics, we were quickly introduced to S+ in order to do monthly assignments. In these assignments, we were to apply the theory we had learned in the lectures on some concrete data. I still remember how difficult I initially found applying the right statistical tools to real-world problems, rather than just understanding the math in theoretical statistics. I developed a deep respect for applied statistics. Our minds can easily be deceived and we need proper statistics to make the right decisions.

JBR: How did you move from technical studies to dealing in rare scientific books and manuscripts? On the surface it seems that these might be two completely unrelated activities. How did you find a path between these two worlds?

CW: It was a gradual shift. When I first began studying, I went into an old antiquarian book shop to acquire some second-hand math and statistic books to supplement my ordinary course texts. One of them was Statistical Methods by Anders Hald. Hald was no longer working at the university, but his text had become a classic. I was fascinated by this book shop. The owner was an old, grey-haired man sitting behind a huge stack of books smoking a pipe, and writing his sales catalogues on an old IBM typing machine. He allowed me to go down into his cellar where there books everywhere from floor to ceiling. There were many books which I wanted to acquire down there, but I hardly had any money. It was a mess in the cellar and I offered to tidy up if he could pay me in the books I wanted, and he agreed. I loved coming to work there, and I continued to do so my entire studies. My boss gave me more and more responsibility and put me in charge of the mathematics, physics, statistics and science books in general. When I finished my masters, I was considering doing a PhD. I loved mathematics and still do until this day. But I also found that when I woke up in the morning, I was thinking of antiquarian books and in the evening I couldn’t get to bed because I was thinking of books. It gave me energy and happiness. So I thought, why not try and be a rare book dealer for a year or two and see how it works out? It’s been 14 years since I made that decision, and I have really enjoyed it. In 2009, I decided to start my own company and specialize in important books and manuscripts in science.

JBR: What is it like to be immersed in these rare artifacts that were so important for the transmission of scientific knowledge? What kinds of scholars do you consult to establish the authenticity of works like Euler’s Opuscula Varii Argumenti or Cauchy’s Leçons sur le calcul diffrentiel?

CW: I feel privileged to handle some of these objects on a daily basis. One day I am sitting with an original autograph manuscript by Einstein doing research on relativity, and the next day I have a presentation copy of Darwin’s Origin of Species in my hands. These are objects which have changed the world and the way we think about ourselves. In addition to the books and manuscripts, I find the people who I meet extremely interesting. A few years before Anders Hald (whose book had originally brought me into my old boss’ shop) passed away, I went to buy his books. He was 92 and completely fresh in his mind. We spoke about the history of statistics – a subject about which he authored several books.

JBR: I have noticed that collectors seem to be very interested in the works of twentieth-century mathematicians and physicists. You have works by Alonzo Church, Kurt Gödel, Richard Feynman, and others in your catalogue. But your roster of statisticians seems to focus on the old masters such as Laplace and de Moivre. Are collectors also interested in Karl Pearson, Udny Yule, and R. A. Fisher?

CW: Certainly. Maybe so much so that every time I get one of Pearson or Fisher’s main papers they sell immediately. That’s why you don’t see them in my stock at the moment.

JBR: I noticed that you had two works by Sofya Vasilyevna Kovalevskaya on display in California. Do you see a renewed interest in the works of women scientists and mathematicians, or is this remarkable and brilliant woman an exception?

CW: There has definitely been a renewed interest in exceptional woman scientists. A few years ago the New York-based Grolier Club hosted an exhibition called ‘Extraordinary Women in Science and Medicine’, and several institutions are focusing on the subject. These woman who broke through the social constraints against them are exceptional and fascinating people.

JBR: Although there are notable exceptions (Donald Knuth’s typesetting comes to mind), I think most data scientists, computer scientists, and statisticians work in a digital world of ebooks and poorly printed texts. Do you think that the technical book as a collectable artifact will survive the twenty-first century? What advice would you give to working data scientists and statisticians who are interested in collecting?

CW: Good question. Many important papers nowadays are not even printed, and the only physical material a researcher might have left from some landmark work might be some scribbles he or she did on a piece of paper. There are examples of people who collect digital art. They use various ways of signing or otherwise authenticating the artists work even if it’s on a USB stick. Maybe that’s how some research papers might be collected in the future?

My advice for anyone wanting to start collecting would be to first focus on some of the classics in their field or some other field that fascinates them. The classics will have been collected by many others in the past and there will be good descriptions, bibliographies, and catalogues describing them and why they are collectible. That way one will gradually get a feeling about which mechanisms are important when collecting and what to focus on, e.g., condition, provenance, etc. And then I’d say it’s important to build a good relationship with at least one dealer with a good reputation in the trade. Any great collection is built on a collaboration were collectors and dealers work together.

JBR: Excellent advice! Thank you Christian.

¹ For example, have a look at some of the R training at this year’s IRE-CAR Conference.

² See, for example, these University of Washington resources for the digital humanities.

Sophia Rare Books (Copenhagen), specializes in rare and important books and manuscripts in the History of Science and Medicine fields.

A Few New R Books

Wed, 20 Feb 2019 00:00:00 +0000

Greg Wilson is a data scientist and professional educator at RStudio.

As a newcomer to R who prefers to read paper rather than pixels, I’ve been working my way through a more-or-less random selection of relevant books over the past few months. Some have discussed topics that I’m already familiar with in the context of R, while others have introduced me to entirely new subjects. This post describes four of them in brief; I hope to follow up with a second post in a few months as I work through the backlog on my desk.

First up is Sharon Machlis’ Practical R for Mass Communcation and Journalism, which is based on the author’s workshops for journalists. This book dives straight into doing the kinds of things a busy reporter or news analyst needs to do to meet a 5:00 pm deadline: data cleaning, presentation-quality graphics, and maps take precedence over control flow or the niceties of variable scope. I particularly enjoyed the way each chapter starts with a realistic project and works through what’s needed to build it. People who’ve never programmed before will be a little intimidated by how many packages they need to download if they try to work through the material on their own, but the instructions are clear, and the author’s enthusiasm for her material shines through in every example. (If anyone is working on a similar tutorial for sports data, please let me know - I have more than a few friends it would make very happy.)

In contrast, Chris Beeley and Shitalkumar Sukhdeve’s Web Application Development with R Using Shiny focuses on a particular tool rather than a industry vertical. It covers exactly what its title promises, step by step from the basics through custom JavaScript functions and animations through persistent storage. Every example I ran was cleanly written and clearly explained, and it’s clear that the authors have tested their material with real audiences. I particularly appreciated the chapter on code patterns - while I’m still not sure I fully understand when and how to use isolate() and req(), I’m much less confused than I was.

Functional programming has been the next big thing in computing since I was a graduate student in the 1980s. It does finally seem to be getting some traction outside the craft-beer-and-Emacs community, and Functional Programming in R by Thomas Mailund looks at how these ideas can be used in R. Mailund writes clearly, and readers who don’t have a background in computer science may find this a gentle way into a complex subject. However, despite the subtitle “Advanced Statistical Programming for Data Science, Analysis and Finance”, there’s nothing particularly statistical or financial about the book’s content. Some parts felt rushed, such as the lightning coverage of point-free programming (which should have had either a detailed exposition or no mention at all), but my biggest complaint about the book is its price: I think $34 for 100 pages is more than most people will want to pay.

Finally, we have Stefano Allesina and Madlen Wilmes’ Computing Skills for Biologists. As the subtitle says, this book presents a toolbox that includes Python, Git, LaTeX, and SQL as well as R, and is aimed at graduate students in biology who have just realized that a few hundred megabytes of messy data are standing between them and their thesis. The authors present the basics of each subject clearly and concisely using real-world data analysis examples at every turn. They freely admit in the introduction that coverage will be broad and shallow, but that’s exactly what books like this should aim for, and they hit a bulls eye. The book’s only weakness - unfortunately, a significant one - is an almost complete lack of diagrams. There are only six figures in its 400 pages, and none in the material on visualization. I realize that readers who are coding along with the examples will be able to view some plots and charts as they go, but I would urge the authors to include these in a second edition.

R is growing by leaps and bounds, and so is the literature about it. If you have written or read a book on R recently that you think others would be interested in, please let us know - we’d enjoy checking it out.

Stefano Allesina and Madlen Wilmes: Computing Skills for Biologists: A Toolbox. Princeton University Press, 978-0691182759.

Chris Beeley and Shitalkumar Sukhdeve: Web Application Development with R Using Shiny (3rd ed.). Packt, 2018, 978-1788993128.

Sharon Machlis: Practical R for Mass Communcation and Journalism. Chapman & Hall/CRC, 2018, 978-1138726918.

Thomas Mailund: Functional Programming in R: Advanced Statistical Programming for Data Science, Analysis and Finance. Apress, 2017, 978-1484227459.

Review of Efficient R Programming

Fri, 19 May 2017 00:00:00 +0000

In the crowded market space of data science and R language books, Lovelace and Gillespie’s Efficient R Programming (2016) stands out from the crowd. Over the course of ten comprehensive chapters, the authors address the primary tenets of developing efficient R programs. Unless you happen to be a member of the R core development team, you will find this book useful whether you are a novice R programmer or an established data scientist and engineer. This book is chock full of useful tips and techniques that will help you improve the efficiency of your R programs, as well as the efficiency of your development processes. Although I have been using R daily (and nearly exclusively) for the past 4+ years, every chapter of this book provided me with new insights into how to improve my R code while helping solidify my understanding of previously learned techniques. Each chapter of Efficient R Programming is devoted to a single topic, each of which includes a “top five tips” list, covers numerous packages and techniques, and contains useful exercises and problem sets for consolidating key insights.

In Chapter 1. Introduction, the authors orient the audience to the key characteristics of R that affect its efficiency, compared to other programming languages. Importantly, the authors address R efficiency not just in the expected sense of algorithmic speed and complexity, but broaden its scope to include programmer productivity and how it relates to programming idioms, IDEs, coding conventions, and community support – all things that can improve the efficiency of writing and maintaining code. This is doubly important for a language like R, which is notoriously flexible in its ability to solve problems in multiple ways. The first chapter concludes by introducing the reader to two valuable packages: (1) microbenchmark, an accurate benchmarking tool with nanosecond precision; and (2) profvis, a handy tool for profiling larger chunks of code. These two packages are repeatedly used throughout the remainder of the book to illustrate key concepts and highlight efficient techniques.

In Chapter 2. Efficient Setup, the reader is introduced to techniques for setting up a development environment that facilitates efficient workflow. Here the authors cover choices in operating system, R version, R start-up, alternative R interpreters, and how to maintain up-to-date packages with tools like packrat and installr. I found their overview of the R startup process particularly useful, as the authors taught me how to modify my .Renviron and .Rprofile files to cache external API keys and customize my R environment, for example by adding alias shortcuts to commonly used functions. The chapter concludes by discussing how to setup and customize the RStudio environment (e.g., modifying code editing preference, editing keyboard shortcuts, and turning off restore .Rdata to help prevent bugs), which can greatly improve individual efficiency.

Chapter 3. Efficient Programming introduces the reader to efficient programming by discussing “big picture” programming techniques and how they relate to the R language. This chapter will most likely be beneficial to established programmers who are new to R, as well as to data scientists and analysts who have limited exposure to programming in a production environment. In this chapter the authors introduce the “golden rule of R programming” before delving into the usual suspects of inefficient R code. Usefully, the book illustrates multiple ways of performing the same task (e.g., data selection) with different code snippets, and highlights the performance differences through benchmarked results. Here we learn about the pitfalls of growing vectors, the benefits of vectorization, and the proper use of factors. The chapter wraps up with the requisite overview of the apply function family, before discussing the use of variable caching (package memoise) and byte compilation as important techniques in writing fast, responsive R code.

Chapter 4. Efficient Workflow will be of primary use to junior-level programmers, analysts, and project managers who haven’t had enough time or practice to develop their own efficient workflows. This chapter discusses the importance of project planning, audience, and scope before delving into common tools that facilitate project management. In my opinion, one of best aspects of R is the huge, maddeningly broad number of packages that are available on CRAN and GitHub. The authors provide useful advice and techniques for identifying the packages that will be of most use to your project. A brief discussion on the use of R Markdown and knitr concludes this chapter.

Chapter 5. Efficient Input/Output is devoted to efficient read/write operations. Anybody who has ever struggled with loading a big file into R for analysis will appreciate this discussion and the packages covered in this chapter. The rio package, which can handle a wide variety of common data file types, provides a useful starting point for exploratory work on a new project. Other packages that are discussed (including readr and data.table) provide more efficient I/O than those in base R. The chapter ends with a discussion of two new file formats and associated packages, (feather and RProtoBuf), that can be used for cross-language, fast, efficient serialized data I/O.

Chapter 6. Efficient Data Carpentry introduces what are, in my opinion, the most useful R tools for data munging – what Lovelace and Gillespie prefer to call by the more admirable term “data carpentry.” This chapter could more aptly be titled the “Tidyverse” or the “Hadleyverse”, for most of the tools discussed in this chapter were developed by prolific R package writer, Hadley Wickham. Sections of the chapter are devoted to each of the primary packages of the tidyverse: tibble, a more useful and user-friendly data.frame; tidyr, used for reshaping data between short and long forms; stringr, which provides a consistent API over obtuse regex functions; dplyr, used for efficient data processing including filtering, sorting, mutating, joining, and summarizing; and of course magrittr, for piping all these operations together with %>%. A brief section on package data.table rounds out the discussion on efficient data carpentry.

Chapter 7. Efficient Optimization begins with the requisite optimization quote by computer scientist Donald Knuth:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

In this chapter, the authors introduce profvis, and they illustrate the utility of this package by showing how it can be used to identify bottlenecks in a Monte Carlo simulation of a Monopoly game. The authors next examine alternative methods in base R that can be used for greater efficiency. These include discussion of if() vs. ifelse(), sorting operations, AND (&) and OR (|) vs. && and ||, row/column operations, and sparse matrices. The authors then apply these tricks to the Monopoly code to show a 20-fold decrease in run time. The chapter concludes with a discussion and examples of parallelization, and the use of Rcpp as an R interface to underlying fast and efficient C++ code.

I found the chapter Efficient Hardware to be the least useful in the book (spoiler alert: add more RAM or migrate to cloud-based services), though the chapter on Efficient Collaboration will be particularly useful for novice data scientists lacking real-world experience developing data artifacts and production applications in a distributed, collaborative environment. In this chapter, the authors discuss the importance of coding style, code comments, version control, and code review. The final chapter Efficient Learning, will find appreciative readers among those just getting started with R (and if this describes you, I would suggest that you start with this chapter first!). Here the authors discuss using and navigating R’s excellent internal help utility, as well as the importance of vignettes and source code in learning/understanding. After briefly introducing swirl, the book concludes with a discussion of online resources, including Stack Overflow; the authors thankfully provide the newbie with important information on how to ask the right questions and the importance of providing a great R reproducible example.

In summary, Lovelace and Gillespie’s Efficient R Programming does an admirable job of illustrating the key techniques and packages for writing efficient R programs. The book will appeal to a wide audience from advanced R programmers to those just starting out. In my opinion, the book hits that pragmatic sweet spot between breadth and depth, and it usefully contains links to external resources for those wishing to delve deeper into a specific topic. After reading this book, I immediately went to work refactoring a Shiny dashboard application I am developing and several internal R packages I maintain for our data science team. In a matter of a few short hours, I witnessed a 5 to 10-fold performance increase in these applications just by implementing a couple of new techniques. I was particularly impressed with the greatly improved end-user performance and the ease with which I implemented intelligent caching with the memoise package for a consumer decision tree application I am developing. If you care deeply about writing beautiful, clean, efficient code and bringing your data science to the next level, I highly recommend adding Efficient R Programming to your arsenal.

The book is published by O’Reilly Media and is available online at the authors’ website, as well as through Safari.

Some Random Weekend Reading

Fri, 24 Mar 2017 00:00:00 +0000

Few of us have enough time to read, and most of us already have depressingly deep stacks of material that we would like to get through. However, sometimes a random encounter with something interesting is all that it takes to regenerate enthusiasm. Just in case you are not going to get to a book store with a good technical section this weekend, here are a few not-quite-random reads.

Deep Learning by Goodfellow, Bengio and Courville is a solid, self-contained introduction to Deep Learning that begins with Linear Algebra and ends with discussions of research topics such as Autoencoders, Representation Learning, and Boltzman Machines. The online layout extends an invitation to click anywhere and begin reading. Sampling the chapters, I found the text to be engaging reading; much more interesting and lucid than just an online resource. For some Deep Learning practice with R and H2O, have a look at the post Deep Learning in R by Kutkina and Feuerriegel.

However, if you are under the impression that getting a handle on Deep Learning will get you totally up to speed with neural network buzzwords, you may be disappointed. Extreme Learning Machines, which “aim to break the barriers between the conventional artificial learning techniques and biological learning mechanisms”, are sure to take you even deeper into the abyss. For a succinct introduction to ELMs with and application to handwritten digit classification, have a look at the recent paper by Pang and Yang. For more than an afternoon’s worth of reading, browse through the IEEE Intelligent Systems issue on Extreme Learning Machines here, and the other resources collected here. See the announcement of the 2014 conference for the full context of the quote above.

For something a little lighter and closer to home, Christopher Gandrud’s page on the networkD3 package is sure to set you browsing through Sankey Diagrams and Force Directed Drawing Alorithms.

library(networkD3)
# Load data
data(MisLinks)
data(MisNodes)

# Plot
forceNetwork(Links = MisLinks, Nodes = MisNodes,
            Source = "source", Target = "target",
            Value = "value", NodeID = "name",
            Group = "group", opacity = 0.8)

Finally, if you are like me and think that the weekends are for catching up on things that you should probably already know, but on which you might be a bit shaky, remember that you can never know enough about GitHub. Compliments of GitHub’s Carolyn Shin, here is some online GitHub reading: GitHub Guides, GitHub on Demand Training, and an online version of the Pro Git Book.

Reading recommendations go both ways. Please feel free to comment with some recommendations of your own.

Book Review: Computer Age Statistical Inference

Fri, 28 Oct 2016 00:00:00 +0000

Computer Age Statistical Inference: Algorithms, Evidence and Data Science by Bradley Efron and Trevor Hastie is a brilliant read. If you are only ever going to buy one statistics book, or if you are thinking of updating your library and retiring a dozen or so dusty stats texts, this book would be an excellent choice. In 475 carefully crafted pages, Efron and Hastie examine the last 100 years or so of statistical thinking from multiple viewpoints. In the nominal approach implied by the book’s title, they describe the impact of computing on statistics, and point out where powerful computers opened up new territory. On the first page of the preface they write:

… the role of electronic computation is central to our story. This doesn’t mean that every advance was computer-related. A land bridge had opened up to a new continent but not all were eager to cross.

Empirical Bayes and James-Stein estimation, they claim, could have been discovered under the constraints of mid-twentieth-century mechanical computation, but discovering the bootstrap, proportional hazard models, large-scale hypothesis testing, and the machine learning algorithms underlying much of data science required crossing the bridge.

A second path opened up in this text stops just short of the high ground of philosophy. Efron and Hastie blow by the great divide of the Bayesian versus Frequentist controversy to carefully consider the strengths and weaknesses of the three main systems of statistical inference: Frequentist, Bayesian and Fisherian Inference. You may have thought that Sir Ronald Fisher was a frequentist, but the inspired thoughts of a man of Fisher’s intellect are not so easily categorized. Efron and Hastie write:

Sir Ronald Fisher was arguably the most influential anti-Bayesian of all time, but that did not make him a conventional frequentist. His key data analytic methods … were almost always applied frequentistically. Their Fisherian rationale, however, often drew on ideas neither Bayesian nor frequentist in nature, or sometimes the two in combination.

Above all, in this text, Efron and Hastie are concerned with the clarity of statistical inference. They take special care to explain the nuances of Frequentist, Bayesian and Fisherian thinking, devoting early chapters to each of these conceptual frameworks. In these, they invite the reader to consider a familiar technique from either a Bayesian, Frequentist or Fisherian point of view. Then they raise issues and contrast and compare the merits of each approach. Unstated, but nagging in the back of my mind while reading these chapters, was the implication that there may, indeed, be other paths to the “science of learning from experience” (the authors’ definition of statistics) that have yet to be discovered.

But don’t let me mislead you into thinking that Computer Age Statistical Inference is mere philosophical fluff that doesn’t really matter day-to-day. Have a look at the table of contents. The book is organized into three parts. “Part I: Classic Statistical Inference” contains five chapters on classical statistical inference, including a gentle introduction to algorithms and inference, three chapters on the inference systems mentioned above, and a chapter on parametric models and exponential families. “Part II: Early Computer-Age Methods” has nine chapters on Empirical Bayes, James-Stein Estimation and Ridge Regression, Generalized Linear Models and Regression Trees, Survival Analysis and the EM Algorithm, The Jackknife and the Bootstrap, Bootstrap Confidence Intervals, Cross-Validation and Cp Estimates of Prediction Error, Objective Bayes Inference and MCMC, and Postwar Statistical Inference and Methodology. “Part III: Twenty-First-Century Topics” dives into the details of large-scale inference and data science, with seven chapters on Large-Scale Hypothesis Testing, Sparse Modeling and the Lasso, Random Forests and Boosting, Neural Networks and Deep Learning, Support Vector Machines and Kernel methods, Inference After Model Selection, and Empirical Bayes Estimation Strategies.

Efron and Hastie will keep your feet firmly on the ground while they walk you slowly through the details, pointing out what is important, and providing the guidance necessary to keep the whole forest in mind while studying the trees. From the first page, they maintain a unified exposition of their material by presenting statistics as a tension between algorithms and inference. With this in mind, it seems plausible that there really isn’t any big disconnect between the strict logic required to think your way through the pitfalls of large-scale hypothesis testing, and the almost cavalier application of machine learning models.

Nothing Efron and Hastie do throughout this entire trip is pedestrian. For example, their approach to the exponential family of distributions underlying generalized linear models doesn’t begin with the usual explanation of link functions fitting into the standard exponential family formula. Instead, they start with a Poisson family example, deriving a 2 parameter general expression for the family and showing how “tilting” the distribution by multiplying by an exponential parameter permits the derivation of other members of the family. The example is interesting in its own right, but the payoff, which comes a couple of pages later, is argument demonstrating how a generalization of the technique keeps the number of parameters required for inference under repeated sampling from growing without bound.

A great pedagogical strength of the book is the “Notes and Details” section concluding each chapter. Here you will find derivation details, explanations of Frequentist, Bayesian and Fisherian inference, and remarks of historical significance. The Epilogue ties everything together with a historical perspective that outlines how the focus of statistical progress has shifted between Applications, Mathematics and Computation throughout the twentieth century and the early part of this century.
Computer Age Statistical Inference contains no code, but it is clearly an R-informed text with several plots and illustrations. The data sets provided on Efron’s website, and the pseudo-code placed throughout the text are helpful for replicating much of what is described. The website points to the boot and bootstrap packages, and provides the code for a function used in the notes to the chapter on bootstrap confidence intervals.

My take on Computer Age Statistical Inference is that experienced statisticians will find it helpful to have such a compact summary of twentieth-century statistics, even if they occasionally disagree with the book’s emphasis; students beginning the study of statistics will value the book as a guide to statistical inference that may offset the dangerously mind-numbing experience offered by most introductory statistics textbooks; and the rest of us non-experts interested in the details will enjoy hundreds of hours of pleasurable reading.

Finally, for those of you who won’t buy a book without thumbing through it, PD Dr. Pablo Emilio Verde has you covered.