R for Quantitative Health Sciences: An Interview with Jarrod Dalton

2019-02-06

by Joseph Rickert

This interview came about through researching R-based medical applications in preparation for the upcoming R/Medicine conference. When we discovered the impressive number of Shiny-based Risk Calculators developed by the Cleveland Clinic and implemented in public-facing sites, we wanted to learn more about the influence of R Language in the development of statistical science at this prominent institution. We were fortunate to have Jarrod Dalton of the Quantitative Health Sciences Department grant this interview.

Jarrod Dalton, PhD is an assistant staff scientist in the Department of Quantitative Health Sciences and an assistant professor of medicine in the Cleveland Clinic Lerner College of Medicine at Case Western Reserve University in Cleveland, Ohio. (Twitter: @daltonjarrod)

JBR: QHS has been a leader in medical statistical research for quite some time. Have recent developments in big data, data science and machine learning changed the nature of your work? Are these trends bringing new tools and new challenges to QHS projects?

JD: Yes and no. On one hand, the health care sector has been at least as dynamic over the past 10-20 years as the fields of data science and machine learning. Clinical and biological research is quite diverse, and these trends have only added to its diversity. We work on studies with p>>n, n>>p, and everything in between. The complexity of problems in biomedical research has spurred new methodological innovations by our department, such as random survival forests. Modern machine learning algorithms are amenable to certain types of problems; I’d say that the most impactful manifestations to date of machine learning in medicine are in the fields of radiology and genomics, where diagnostic and prognostic problems are well-defined.

On the other hand, there are unique challenges relating to the application of machine learning algorithms in medicine. Doctors and patients are averse to trusting a model if they don’t understand how and why the prediction is being made. Doctors often have justifiable and either unquantified or unquantifiable reasons for disbelieving predictions on the basis of other clinical information they obtain during the process of care. Predictions inform treatment decisions (or decisions not to administer treatments), and many other issues are involved with optimal clinical decision-making (e.g., physician judgment, patient preferences, cost effectiveness of different therapies, discounting rate for health events that are distant in the future, trade-offs between quality of life and longevity, issues relating to health literacy, numeracy and the communication of risk, and desired degree of participation in the decision-making process on behalf of the patient).

Much of our work has, and will continue to be, in the clinical trials space, as well as good-old biostatistical consulting. We have a department of over 100 people and publish over 400 academic research studies every year, the vast majority of which arising from traditional statistical collaboration.

JBR: Ten years ago, a typical medical statistics department might consist of a number of Ph.D. statisticians who did almost no coding supported by a legion of SAS programmers. Have open-source languages such as R and Python changed the way work gets done? Do more statisticians now do their own coding? Do you see a movement away from SAS towards open source tools? Do you see clinicians doing their own analyses with R?

JD: We have dedicated consulting teams that are embedded within some of the more research-intensive clinical specialties at our institution, such as heart and vascular, oncology, urology, anesthesiology, neurology, and orthopaedics. Another consulting team, which we call the “alpha-beta team”, is composed of statisticians who allocate their time to the smaller sub-specialties. More recently, we have been successful at establishing externally funded research labs, headed by QHS principal investigators. Each of these teams has their own way of doing things. We are supported by a number of dedicated RStudio servers, as well as a high-performance computing cluster with R and Python capabilities. On the SAS front, our institution has a high-performance Enterprise Miner environment to support both research and business intelligence. All this having been said, roughly speaking, about half of our department uses R and half uses SAS. Some of our researchers in genomics and image analytics use Python, or complex pipelines that incorporate Python, R, and other tools.

I personally have used R since 2002. I have seen the power of open-source software, with R constantly reinventing itself in a variety of ways. I can’t believe that ggplot and plyr are 10 years old. The tidyverse has changed the way I think. This is especially so for the dplyr and purrr packages, which have enabled much greater efficiency and transparency. My team has recently taken advantage of distributed database computing via a dbplyr/Teradata Warehouse stack, using electronic health data from 2.7 million patients.

More and more physician scientists are training with R. Our partner university, Case Western Reserve, has a two-course sequence on data science based in R, and that sequence is a component of several Master’s programs that many of our clinicians pursue. Some of them can code up a storm! Others know just enough to be dangerous.

JBR: The number of Shiny-based Risk Calculators implemented on your website is astounding, both in their level of sophistication and in the number of topics covered. What is your goal for this project, and how would you like the calculators to be used? Can you say something about the challenges (both medical and technical) you faced in building these calculators?

JD: The goal of the project is to inform clinicians as to our best estimates of predicted outcomes. These predictions have been shown in several studies to be more accurate than clinical judgment or crude decision trees. Ultimately, these more accurate predictions should translate into better medical decision-making, especially with regard to treatment selection. The major challenge occurs up front: working with the clinician to clearly articulate the prediction that is needed – that which would be most hopeful for prospective decision-making. Modeling usually goes very well except when the outcome of interest is very rare: those models often turn out to be not very useful clinically because they never predict a high probability for the rare outcome.

From a technical perspective, the challenge is how to make our data insights and predictive models available online. Before R shiny, our RiskCalc team had tried several web platforms and were not satisfied. We have some very sophisticated models and those platforms either do not support complex computing algorithms or require a lot of programming effort. Using R shiny makes the process of converting our models into web applications quick and easy. The next steps for our RiskCalc team are to improve the user interface and collect feedback from the clinicians.

JBR: How important is reproducible research to QHS, and what role does R play in building reproducible workflows?

We have seen a steady progression toward reproducible research practices in medicine. All clinical trial protocols must now be pre-registered, and proof of adherence to the pre-registered protocol is now a standard requirement of many of the top journals. Somewhat controversially, there has recently been a lot of discussion about a “replication crisis”, particularly in the psychological sciences (but perhaps unfairly so). In any case, the increased focus on replicability has led to an increased need for reproducible research practices.

Nik Krieger and I have recently made an R package, called projects, that is specifically designed for reproducible academic manuscript development workflows. While the projects package has other features - like the ability to develop and maintain a coauthor database, complete with institutional affiliations, or the ability to automatically generate title pages for manuscripts using the authors’ institutional affiliations - its core functionality is establishing project directories with Markdown templates corresponding to each phase of the academic research pipeline (protocol, data wrangling, analysis, and reporting). Project metadata are stored in a tibble, so that teams can prioritize and strategize directly from the R console.

JBR: Do you have any additional thoughts about the use of R in Medicine that you would like to share with our readers?

What comes to mind are current challenges in integrating R into production environments in medicine, such as the electronic health record (EHR). EHR systems are not open-source, and there are many vendors. Even for a single vendor, implementations at different institutions may look wildly different from a data perspective. Our EHR system has over 10,000 tables. There are so many challenges to implementing anything in the healthcare space. That may sound pessimistic, but I actually intend to communicate the significant number of opportunities for using R to positively influence the health of populations. We have fantastic clinical partners and champions. We’re always getting better. The work is important and rewarding.