Frank's R Workflow

by Joseph Rickert

Frank Harrell’s new eBook, R Workflow, which aims to: “to foster best practices in reproducible data documentation and manipulation, statistical analysis, graphics, and reporting” is an ambitious document that is notable on multiple levels.

To begin with, the workflow itself is much more than a simple progression of logical steps.

Diagram of Reproducible Research Workflow

This workflow is clearly the result of a process forged through trial and error by a master statistician over many years. As the diagram indicates, the document takes a holistic viewpoint of a statistical analysis covering document preparation, data manipulation, statistical practice computational concerns, and more.

Then, there is the synthesis of a wide range of content into a succinct, very readable exposition that dips in to some very deep topics. Frank’s examples are streamlined presentations of analyses and code that are both sophisticated an practical. The missing value section suggests a whole array of analyses through a careful presentation of plots, and the section on data checking introduces a level of automation beyond what is commonly done.

Frank’s writing style is clear, informal and from the perspective of a teacher who wants to show you some cool things along with the basics. For example, don’t miss the if Trick in section 2.4.3.

I should mention that Frank’s eBook is not a tidyverse presentation. The code examples are built around base R, Frank’s Hmisc and rms packages and an eclectic mix of packages that include data.table. plotly and tidyverse packages haven and ggplot2. In a way, this selection of packages reflects the evolution of R itself. For example, as with many popular R packages, Hmisc most likely started out as Frank’s personal tool kit. However, after many years of Frank’s deep commitment to using R and contributing R tools, which includes seventy versions of Hmisc in nineteen years, the package has become a fundamental resource. (Have a look at the reverse depends, imports, and suggests.) Also, the mix of packages with different design philosophies underlying R Workflow reflects the flexibility of the R language and the organic growth of the R ecosystem.

Perhaps the most striking aspect of the eBook is the way Frank uses Quarto, knitr and Hmisc to build an elegant reproducible document about building reproducible documents. For example, Quarto permits the effective placement of plots in the right margins of the document, and the Quarto callouts in Section 3.4 enable the mini tutorials that include Special Considerations for Latex/pdf and Using Tooltips with Mermaid to be embedded in the document without interrupting its flow. Moreover, along with functions like Hmisc::getHdata() and Hmisc::getRs(), Quarto enables the document to achieve a high level of reproducibility by pulling data and code directly from GitHub repositories.

Not only can Frank’s R Workflow teach you some serious statistics, but studying its construction will take you a long way towards building aesthetically pleasing reproducible documents.

Frank Harrell will be delivering a keynote address on August 26th at the upcoming R/Medicine conference.

Share Comments · · · · · · ·

You may leave a comment below or discuss the post in the forum community.rstudio.com.