R In Medicine on R Views
https://rviews.rstudio.com/categories/r-in-medicine/index.xml
Recent content in R In Medicine on R ViewsHugo -- gohugo.ioen-usRStudio, Inc. All Rights Reserved.R for Quantitative Health Sciences: An Interview with Jarrod Dalton
https://rviews.rstudio.com/2019/02/06/r-for-quantitative-health-sciences-an-interview-with-jarrod-dalton/
Wed, 06 Feb 2019 00:00:00 +0000https://rviews.rstudio.com/2019/02/06/r-for-quantitative-health-sciences-an-interview-with-jarrod-dalton/
<p>This interview came about through researching R-based medical applications in preparation for the upcoming <a href="https://r-medicine.com/">R/Medicine</a> conference. When we discovered the impressive number of Shiny-based <a href="http://riskcalc.org:3838/">Risk Calculators</a> developed by the <a href="https://my.clevelandclinic.org/">Cleveland Clinic</a> and implemented in public-facing sites, we wanted to learn more about the influence of R Language in the development of statistical science at this prominent institution. We were fortunate to have Jarrod Dalton of the <a href="https://www.lerner.ccf.org/qhs/">Quantitative Health Sciences</a> Department grant this interview.</p>
<p>Jarrod Dalton, PhD is an assistant staff scientist in the Department of Quantitative Health Sciences and an assistant professor of medicine in the Cleveland Clinic Lerner College of Medicine at Case Western Reserve University in Cleveland, Ohio. (Twitter: @daltonjarrod)</p>
<p><em>JBR: QHS has been a leader in medical statistical research for quite some time. Have recent developments in big data, data science and machine learning changed the nature of your work? Are these trends bringing new tools and new challenges to QHS projects?</em></p>
<p>JD: Yes and no. On one hand, the health care sector has been at least as dynamic over the past 10-20 years as the fields of data science and machine learning. Clinical and biological research is quite diverse, and these trends have only added to its diversity. We work on studies with p>>n, n>>p, and everything in between. The complexity of problems in biomedical research has spurred new methodological innovations by our department, such as random survival forests. Modern machine learning algorithms are amenable to certain types of problems; I’d say that the most impactful manifestations to date of machine learning in medicine are in the fields of radiology and genomics, where diagnostic and prognostic problems are well-defined.</p>
<p>On the other hand, there are unique challenges relating to the application of machine learning algorithms in medicine. Doctors and patients are averse to trusting a model if they don’t understand how and why the prediction is being made. Doctors often have justifiable and either unquantified or unquantifiable reasons for disbelieving predictions on the basis of other clinical information they obtain during the process of care. Predictions inform treatment decisions (or decisions not to administer treatments), and many other issues are involved with optimal clinical decision-making (e.g., physician judgment, patient preferences, cost effectiveness of different therapies, discounting rate for health events that are distant in the future, trade-offs between quality of life and longevity, issues relating to health literacy, numeracy and the communication of risk, and desired degree of participation in the decision-making process on behalf of the patient).</p>
<p>Much of our work has, and will continue to be, in the clinical trials space, as well as good-old biostatistical consulting. We have a department of over 100 people and publish over 400 academic research studies every year, the vast majority of which arising from traditional statistical collaboration.</p>
<p><em>JBR: Ten years ago, a typical medical statistics department might consist of a number of Ph.D. statisticians who did almost no coding supported by a legion of SAS programmers. Have open-source languages such as R and Python changed the way work gets done? Do more statisticians now do their own coding? Do you see a movement away from SAS towards open source tools? Do you see clinicians doing their own analyses with R?</em></p>
<p>JD: We have dedicated consulting teams that are embedded within some of the more research-intensive clinical specialties at our institution, such as heart and vascular, oncology, urology, anesthesiology, neurology, and orthopaedics. Another consulting team, which we call the “alpha-beta team”, is composed of statisticians who allocate their time to the smaller sub-specialties. More recently, we have been successful at establishing externally funded research labs, headed by QHS principal investigators. Each of these teams has their own way of doing things. We are supported by a number of dedicated RStudio servers, as well as a high-performance computing cluster with R and Python capabilities. On the SAS front, our institution has a high-performance Enterprise Miner environment to support both research and business intelligence. All this having been said, roughly speaking, about half of our department uses R and half uses SAS. Some of our researchers in genomics and image analytics use Python, or complex pipelines that incorporate Python, R, and other tools.</p>
<p>I personally have used R since 2002. I have seen the power of open-source software, with R constantly reinventing itself in a variety of ways. I can’t believe that <code>ggplot</code> and <code>plyr</code> are 10 years old. The tidyverse has changed the way I think. This is especially so for the <code>dplyr</code> and <code>purrr</code> packages, which have enabled much greater efficiency and transparency. My team has recently taken advantage of distributed database computing via a <code>dbplyr</code>/Teradata Warehouse stack, using electronic health data from 2.7 million patients.</p>
<p>More and more physician scientists are training with R. Our partner university, Case Western Reserve, has a two-course sequence on data science based in R, and that sequence is a component of several Master’s programs that many of our clinicians pursue. Some of them can code up a storm! Others know just enough to be dangerous.</p>
<p><em>JBR: The number of Shiny-based <a href="http://riskcalc.org:3838/">Risk Calculators</a> implemented on your website is astounding, both in their level of sophistication and in the number of topics covered. What is your goal for this project, and how would you like the calculators to be used? Can you say something about the challenges (both medical and technical) you faced in building these calculators?</em></p>
<p>JD: The goal of the project is to inform clinicians as to our best estimates of predicted outcomes. These predictions have been shown in several studies to be more accurate than clinical judgment or crude decision trees. Ultimately, these more accurate predictions should translate into better medical decision-making, especially with regard to treatment selection. The major challenge occurs up front: working with the clinician to clearly articulate the prediction that is needed – that which would be most hopeful for prospective decision-making. Modeling usually goes very well except when the outcome of interest is very rare: those models often turn out to be not very useful clinically because they never predict a high probability for the rare outcome.</p>
<p>From a technical perspective, the challenge is how to make our data insights and predictive models available online. Before R shiny, our RiskCalc team had tried several web platforms and were not satisfied. We have some very sophisticated models and those platforms either do not support complex computing algorithms or require a lot of programming effort. Using R shiny makes the process of converting our models into web applications quick and easy. The next steps for our RiskCalc team are to improve the user interface and collect feedback from the clinicians.</p>
<p><em>JBR: How important is reproducible research to QHS, and what role does R play in building reproducible workflows?</em></p>
<p>We have seen a steady progression toward reproducible research practices in medicine. All clinical trial protocols must now be pre-registered, and proof of adherence to the pre-registered protocol is now a standard requirement of many of the top journals. Somewhat controversially, there has recently been a lot of discussion about a “replication crisis”, particularly in the psychological sciences (but perhaps unfairly so). In any case, the increased focus on replicability has led to an increased need for reproducible research practices.</p>
<p>Nik Krieger and I have recently made an R package, called <a href="https://github.com/NikKrieger/projects"><code>projects</code></a>, that is specifically designed for reproducible academic manuscript development workflows. While the projects package has other features - like the ability to develop and maintain a coauthor database, complete with institutional affiliations, or the ability to automatically generate title pages for manuscripts using the authors’ institutional affiliations - its core functionality is establishing project directories with Markdown templates corresponding to each phase of the academic research pipeline (protocol, data wrangling, analysis, and reporting). Project metadata are stored in a tibble, so that teams can prioritize and strategize directly from the R console.</p>
<p><em>JBR: Do you have any additional thoughts about the use of R in Medicine that you would like to share with our readers?</em></p>
<p>What comes to mind are current challenges in integrating R into production environments in medicine, such as the electronic health record (EHR). EHR systems are not open-source, and there are many vendors. Even for a single vendor, implementations at different institutions may look wildly different from a data perspective. Our EHR system has over 10,000 tables. There are so many challenges to implementing anything in the healthcare space. That may sound pessimistic, but I actually intend to communicate the significant number of opportunities for using R to positively influence the health of populations. We have fantastic clinical partners and champions. We’re always getting better. The work is important and rewarding.</p>
<script>window.location.href='https://rviews.rstudio.com/2019/02/06/r-for-quantitative-health-sciences-an-interview-with-jarrod-dalton/';</script>
Statistics in Glaucoma: Part III
https://rviews.rstudio.com/2018/12/18/statistics-in-glaucoma-part-iii/
Tue, 18 Dec 2018 00:00:00 +0000https://rviews.rstudio.com/2018/12/18/statistics-in-glaucoma-part-iii/
<p><em>Samuel Berchuck is a Postdoctoral Associate in Duke University’s Department of Statistical Science and Forge-Duke’s Center for Actionable Health Data Science.</em></p>
<p><em>Joshua L. Warren is an Assistant Professor of Biostatistics at Yale University.</em></p>
<div id="looking-forward-in-glaucoma-progression-research" class="section level2">
<h2>Looking Forward in Glaucoma Progression Research</h2>
<p>The contribution of the <code>womblR</code> package and corresponding statistical methodology is a technique for correctly accounting for the complex spatial structure of the visual field. The purpose of this method is to properly model visual field data, so that an effective diagnostic is derived that discriminates progression status. This is one of many important clinical questions that needs to be addressed in glaucoma research. Others include: quantifying visual field variability to create simulations of healthy and progression patients, combining multiple data modalities to obtain a composite diagnostic, and predicting the timing and spatial location of future vision loss. There is opportunity within the glaucoma literature for the development of quantitative methods that answer important clinical questions, are easy to understand, and are simple to use. To this end, in closing this three-part series, we present a final example of a new method that uses change points to assess future vision loss.</p>
</div>
<div id="modeling-changes-on-the-visual-field-using-spatial-change-points" class="section level2">
<h2>Modeling Changes on the Visual Field Using Spatial Change Points</h2>
<p>To motivate the use of change points, we note that patients diagnosed with glaucoma are often monitored for years with slow changes in visual functionality. It is not until disease progression that notable vision loss occurs, and the deterioration is often swift. This disease course inspires a modeling framework that can identify a point of functional change in the course of follow-up, thus change points are employed. This is an appealing modeling framework, because the concept of disease progression becomes intrinsically parameterized into the model, with the change point representing the point of functional change. In this model, the time of the change point triggers a simultaneous change in both the mean and variance process. Furthermore, to account for the typical trend of a long period of monitoring with little change followed by abrupt vision loss, we force the mean and variance to be constant before the change point. For the mean process, and assuming data from a patient with nine visual field tests, this results in <span class="math display">\[\mu_t\left(\mathbf{s}_i\right)=\left\{ \begin{array}{ll}
{\beta}_0\left(\mathbf{s}_i\right) & x_t \leq \theta\left(\mathbf{s}_i\right),\\
{\beta}_0\left(\mathbf{s}_i\right) + {\beta}_1\left(\mathbf{s}_i\right)\left\{x_t-\theta\left(\mathbf{s}_i\right)\right\} & \theta\left(\mathbf{s}_i\right) \leq x_t.\end{array} \right. \quad t = 1,\ldots,9 \quad i = 1,\ldots,52 \]</span> Here, the change point at location <span class="math inline">\(\mathbf{s}_i\)</span> is given by <span class="math inline">\(\theta(\mathbf{s}_i)\)</span>, and <span class="math inline">\(x_t\)</span> is the days from baseline visit for follow-up visit <span class="math inline">\(t\)</span>. A final important detail is that the change points <span class="math inline">\(\theta(\mathbf{s}_i)\)</span> are truncated in the observed follow-up range, <span class="math inline">\((x_1, x_9)\)</span>. In practice, the true change point can occur outside of this region, so we define a latent process, <span class="math inline">\(\eta(\mathbf{s}_i)\)</span>, that defines the true change point, <span class="math inline">\(\theta(\mathbf{s}_i) = \max\{\min\{\eta(\mathbf{s}_i), x_9\}, x_1\}\)</span>. Finally, all of the location-specific effects are modeled using a novel multivariate conditional autoregressive (MCAR) prior that incorporates the anatomy detailed in the <code>womblR</code> method. More details can be found in Berchuck et al. 2018.</p>
<p>We once again rely on MCMC methods for inference, and a package similar to <code>womblR</code> was developed that implements the spatially varying change point model, <code>spCP</code>. This package has much of the same functionality as <code>womblR</code>, and we demonstrate its functionality below.</p>
<p>We begin by loading <code>spCP</code>. All of the visual field data (<code>VFSeries</code>), adjacencies (<code>HFAII_Queen</code>), and anatomical angles (<code>GarwayHeath</code>) are included in the <code>spCP</code> package.</p>
<pre class="r"><code>###Load package
library(spCP)
###Format data
blind_spot <- c(26, 35) # define blind spot
VFSeries <- VFSeries[order(VFSeries$Location), ] # sort by location
VFSeries <- VFSeries[order(VFSeries$Visit), ] # sort by visit
VFSeries <- VFSeries[!VFSeries$Location %in% blind_spot, ] # remove blind spot locations
Y <- VFSeries$DLS # define observed outcome data
Time <- unique(VFSeries$Time) / 365 # years since baseline visit
MaxTime <- max(Time)
###Neighborhood objects
W <- HFAII_Queen[-blind_spot, -blind_spot] # visual field adjacency matrix
M <- dim(W)[1] # number of locations
DM <- GarwayHeath[-blind_spot] # Garway-Heath angles
Nu <- length(Time) # number of visits
###Obtain bounds for spatial parameter (details are in Berchuck et al. 2018)
pdist <- function(x, y) pmin(abs(x - y), (360 - pmax(x, y) + pmin(x, y))) #Dissimilarity metric distance function (i.e., circumference)
DM_Matrix <- matrix(nrow = M, ncol = M)
for (i in 1:M) {
for (j in 1:M) {
DM_Matrix[i, j] <- pdist(DM[i], DM[j])
}
}
BAlpha <- -log(0.5) / min(DM_Matrix[DM_Matrix > 0])
AAlpha <- 0
###Hyperparameters
Hypers <- list(Alpha = list(AAlpha = AAlpha, BAlpha = BAlpha),
Sigma = list(Xi = 6, Psi = diag(5)),
Delta = list(Kappa2 = 1000))
###Starting values
Starting <- list(Sigma = 0.01 * diag(5),
Alpha = mean(c(AAlpha, BAlpha)),
Delta = c(0, 0, 0, 0, 0))
###Metropolis tuning variances
Tuning <- list(Lambda0Vec = rep(1, M),
Lambda1Vec = rep(1, M),
EtaVec = rep(1, M),
Alpha = 1)
###MCMC inputs
MCMC <- list(NBurn = 10000, NSims = 250000, NThin = 25, NPilot = 20)</code></pre>
<p>Once the inputs have been properly formatted, the program can be run.</p>
<pre class="r"><code>###Run MCMC sampler
reg.spCP <- spCP(Y = Y, DM = DM, W = W, Time = Time,
Starting = Starting, Hypers = Hypers, Tuning = Tuning, MCMC = MCMC,
Family = "tobit",
Weights = "continuous",
Distance = "circumference",
Rho = 0.99,
ScaleY = 10,
ScaleDM = 100,
Seed = 54)
## Burn-in progress: |*************************************************|
## Sampler progress: 0%.. 10%.. 20%.. 30%.. 40%.. 50%.. 60%.. 70%.. 80%.. 90%.. 100%.. </code></pre>
<p>To visualize the estimated change points, we can use the <code>PlotCP</code> function from <code>spcP</code>. The function requires the model fit object and the original data set, plus the variable names of the raw DLS, time (in years), and spatial locations.</p>
<pre class="r"><code>VFSeries$TimeYears <- VFSeries$Time / 365
PlotCP(reg.spCP,
VFSeries,
dls = "DLS",
time = "TimeYears",
location = "Location",
cp.line = TRUE,
cp.ci = TRUE)</code></pre>
<p><img src="/post/2018-12-12-statistics-in-glaucoma-part-iii_files/figure-html/unnamed-chunk-4-1.png" width="689.28" style="display: block; margin: auto;" /></p>
<p>Using the <code>PlotCP</code> function, we present the posterior means of the change points using a blue vertical line, with dashed 95% credible intervals. Furthermore, the mean process and credible interval are plotted using red lines, and the raw DLS values are given by black points. For this example patient, the majority of the change points are at the edges of follow-up. When the DLS is constant over time, the estimated change points are at the end of follow-up, while any trends that are present before follow-up correspond to the change point occurring at the beginning. This information provides clinicians with visual and quantitative confirmation of functional changes across the visual field.</p>
</div>
<div id="change-points-as-a-proxy-for-progression" class="section level2">
<h2>Change Points as a Proxy for Progression</h2>
<p>To formalize the importance of the change points, we look to convert their presence or absence into a clinical decision. We decide to calculate the probability that a change point has been observed at each location across the visual field. To provide a tool that is useful for clinicians, we create a gif that presents the probability of a change point throughout a patient’s follow-up, and are able to predict one and a half years into the future. In Berchuck et al. 2018, it is shown that these change points are highly predictive of progression.</p>
<p>We begin by extracting and calculating the change point probabilities.</p>
<pre class="r"><code>###Extract change point posterior samples
eta <- reg.spCP$eta
###Convert change points to probabilties of occuring before time t
NFrames <- 50 # number of frames in GIF
GIF_Times <- seq(0, MaxTime + 1.5, length.out = NFrames) # obtain GIF 1.5 years after the end of follow-up
GIF_Days <- round(GIF_Times * 365) # convert to days for use later
CP_Probs <- matrix(nrow = M, ncol = NFrames)
###Obtain probabilties at each time point
for (t in 1:NFrames) {
CP_Probs[, t] <- apply(eta, 2, function(x) mean(x < GIF_Times[t]))
}
colnames(CP_Probs) <- GIF_Times</code></pre>
<p>Now, to create a gif of the probabilities, we use the <code>magick</code> package, and in particular, the functions <code>image_graph</code> and <code>image_animate</code>. Furthermore, we use the <code>PlotSensitivity</code> function from <code>womblR</code> to plot the predicted probabilities on the visual field.</p>
<pre class="r"><code>###Load packages
library(magick) # package for creating GIFs
library(womblR) # loaded for PlotSensitivity
###Create GIF
Img <- image_graph(600, 600, res = 96)
for (f in 1:NFrames) {
p <- womblR::PlotSensitivity(CP_Probs[, f],
legend.lab = expression(paste("Pr[", eta, "(s)] < ", t)),
zlim = c(0, 1),
bins = 250,
legend.round = 2,
border = FALSE,
main = bquote("Days from baseline: t = " ~ .(GIF_Days[f]) ~ " (" ~ t[max] ~ " = " ~ .(Time[Nu] * 365) ~ ")"))
}
dev.off()</code></pre>
<pre><code>## quartz_off_screen
## 2</code></pre>
<p>Now, we animate and print the created gif using the <code>image_animate</code> function, specifying 10 frames per second using the <code>fps</code> option.</p>
<pre class="r"><code>Animation <- image_animate(Img, fps = 10)
Animation</code></pre>
<p><img src="/post/2018-12-12-statistics-in-glaucoma-part-iii_files/figure-html/unnamed-chunk-7-1.gif" style="display: block; margin: auto;" /></p>
<p>This gif has many properties that make it clinically useful. The space-time nature of the image allows for clinicians to understand not only the current state of the disease, but also the progression pattern throughout all of follow-up. Furthermore, the gif shows the pattern and future risk of progression over the next one and a half years, presenting clinicians a tool for planning for future risk.</p>
</div>
<div id="conclusions-and-future-directions" class="section level2">
<h2>Conclusions and Future Directions</h2>
<p>The hope in developing these <code>R</code> packages is for them to be used clinically, and to inspire other quantitative scientists to do the same. When statistical methods are typically developed for medical research, it is more common for the methodologies to be published without any corresponding software package. This means that no matter how impactful the method may be, it is unlikely to make a clinical impact for many years, due to the complexity in implementing the inferential methods. Clinicians are dependent on quantitative methods for analyzing the massive amounts of data that exist in today’s world, and they are typically reliant on the proprietary software that is built into the imaging machines themselves. This software is useful, but because the methods are often not published, it can be difficult to interpret the results. More open-source software being developed for medical research will lead to greater collaboration and visibility of the important problems being addressed by health researchers. The <code>R</code> environment, including CRAN and RStudio, make it particularly easy to create and share <code>R</code> packages, and the development of <code>Rcpp</code> and its relatives allow for the packages to be computationally fast. Our hope is that the <code>womblR</code> and <code>spCP</code> packages illustrate this concept and excite people to get involved in glaucoma research, or one of many other important health areas.</p>
</div>
<div id="reference" class="section level2">
<h2>Reference</h2>
<ol style="list-style-type: decimal">
<li>Berchuck, S.I., Mwanza, J.C., & Warren, J.L. (2018). <a href="https://arxiv.org/pdf/1811.11038.pdf">“A Spatially Varying Change Points Model for Monitoring Glaucoma Progression Using Visual Field Data”</a>.</li>
</ol>
</div>
<script>window.location.href='https://rviews.rstudio.com/2018/12/18/statistics-in-glaucoma-part-iii/';</script>
Statistics in Glaucoma: Part II
https://rviews.rstudio.com/2018/12/07/statistics-in-glaucoma-part-ii/
Fri, 07 Dec 2018 00:00:00 +0000https://rviews.rstudio.com/2018/12/07/statistics-in-glaucoma-part-ii/
<p><em>Samuel Berchuck is a Postdoctoral Associate in Duke University’s Department of Statistical Science and Forge-Duke’s Center for Actionable Health Data Science.</em></p>
<p><em>Joshua L. Warren is an Assistant Professor of Biostatistics at Yale University.</em></p>
<div id="analyzing-visual-field-data" class="section level2">
<h2>Analyzing Visual Field Data</h2>
<p>In Part I of this series on statistic in glaucoma, we detailed the use of visual fields for understanding functional vision loss in glaucoma patients. Before discussing a new method for modeling visual field data that accounts for the anatomy of the eye, we discussed how visual field data is typically analyzed by introducing a common diagnostic metric, point-wise linear regression (PLR). PLR is a trend-based diagnostic that uses slope p-values from the location specific linear regressions to discriminate progression status. The motivation for PLR is straightforward, assuming that large negative slopes at numerous visual field locations is indicative of progression. This is characteristic of a large class of methods for analyzing visual field data that attempt to discriminate progression based on changes in the DLS across time. This technique is simple, intuitive, and effective; however, it is often limited due to the naivete of modeling assumptions, including the independence of visual field locations.</p>
</div>
<div id="ocular-anatomy-in-the-neighborhood-structure-of-the-visual-field" class="section level2">
<h2>Ocular Anatomy in the Neighborhood Structure of the Visual Field</h2>
<p>To properly account for the spatial dependencies on the visual field, Berchuck et al. 2018 introduce a neighborhood model that incorporates anatomical information through a dissimilarity metric. Details of the method can be found in Berchuck et al. 2018, but we provide a quick introduction. The key development is the specification of the neighborhood structure through a new definition of adjacency weights. Typically in areal data, the adjacency for two locations <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span> is defined as <span class="math inline">\(w_{ij} = 1(i \sim j)\)</span>, where <span class="math inline">\(i \sim j\)</span> is the event that locations <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span> are neighbors. As discussed in Part I, this assumption is not sufficient due to the complex anatomy of the eye. To account for this additional structure, a more general adjacency is introduced that is a function of a dissimilarity metric, <span class="math inline">\(w_{ij}(\alpha_t) = 1(i \sim j)\exp\{-z_{ij}\alpha_t\}\)</span>. Here, <span class="math inline">\(z_{ij}\)</span> is a dissimilarity metric that represents the absolute difference between the Garway-Heath angles of locations <span class="math inline">\(i\)</span> and <span class="math inline">\(j\)</span>.</p>
<p>The parameter <span class="math inline">\(\alpha_t\)</span> dictates the importance of the dissimilarity metric at each visual field exam <span class="math inline">\(t\)</span>. When <span class="math inline">\(\alpha_t\)</span> becomes large, the model reduces to an independent process, and as <span class="math inline">\(\alpha_t\)</span> goes to zero, the process becomes the standard spatial model for areal data. Based on the specification of the adjacency weights, <span class="math inline">\(\alpha_t\)</span> has a useful interpretation with respect to deterioration of visual ability. In particular, <span class="math inline">\(\alpha_t\)</span> changing over exams indicates that the neighborhood structure on the visual field is changing, which in turn implies damage to the underlying retinal ganglion cell structure. This observation motivates a diagnostic of progression that quantifies variability in <span class="math inline">\(\alpha_t\)</span> across time. We choose the coefficient of variation (CV) and demonstrate that is a highly significant predictor of progression, and furthermore, independent of trend-based methods such as PLR.</p>
</div>
<div id="navigating-the-womblr-package" class="section level2">
<h2>Navigating the <code>womblR</code> Package</h2>
<p>To make the method available to clinicians, the R package <code>womblR</code> was developed. The package provides a suite of functions that walk a user through the full process of analyzing a series of visual fields from beginning to end. The user interface was modeled after other impactful R packages for Bayesian spatial analysis, including <code>spBayes</code> and <code>CARBayes</code>. The package name combines Hadley’s naming convention for R packages (i.e., ending a package with the letter R) with the name of the author of the seminal paper on boundary detection, originally referred to areal wombling (Womble 1951).</p>
<p>We will now walk through the process of analyzing visual field data, estimating the <span class="math inline">\(\alpha_t\)</span> parameters, and assessing progression status. The main function in <code>womblR</code> is the Spatiotemporal Boundary Detection with Dissimilarity Metric model function (<code>STBDwDM</code>). Inference for the method is obtained through Markov chain Monte Carlo (MCMC), which is a computationally intensive method that iterates between updating individual model parameters until enough posterior samples have been collected post-convergence for making accurate posterior inference. Because of the iterative nature of MCMC, the majority of computation is performed within a <code>for</code> loop, so the package is built on C++ through the packages <code>Rcpp</code> and <code>RcppArmadillo</code>. Because of the increased complexity of writing in C++, the pre- and post-processing of the model are done in <code>R</code> with the <code>for</code> loop implemented in C++. The MCMC method employed in <code>womblR</code> is a Metropolis-Hastings within Gibbs algorithm.</p>
<p>Just as a quick aside, with the more recent advent of probabilistic programming, this model could have been implemented using the Hamiltonian Monte Carlo methods used in software like Stan or PyMC3. These programs do not require the derivation of full conditionals, and push the MCMC algorithm to the background. There is undoubtedly a huge market for this type of software, and it is clearly playing a significant role in the popularization of Bayesian modeling. At the same time, implementing MCMC samplers using <code>Rcpp</code> with traditional MCMC algorithms can be instructive, and for those with experience, nearly as quick of a coding experience.</p>
<p>We now begin by formatting the visual field data for analysis. According to the manual, the observed data <code>Y</code> must first be ordered spatially and then temporally. Furthermore, we will remove all locations that correspond to the natural blind spot (which, in the Humphrey Field Analyzer-II, correspond to locations 26 and 35).</p>
<pre class="r"><code>###Load package
library(womblR)
###Format data
blind_spot <- c(26, 35) # define blind spot
VFSeries <- VFSeries[order(VFSeries$Location), ] # sort by location
VFSeries <- VFSeries[order(VFSeries$Visit), ] # sort by visit
VFSeries <- VFSeries[!VFSeries$Location %in% blind_spot, ] # remove blind spot locations
Y <- VFSeries$DLS # define observed outcome data</code></pre>
<p>Now that we have assigned the observed outcomes to <code>Y</code>, we move onto the temporal variable <code>Time</code>. For visual field data, we define this to be the time from the baseline visit. We obtain the unique days from the baseline visit and scale them to be on the year scale.</p>
<pre class="r"><code>Time <- unique(VFSeries$Time) / 365 # years since baseline visit
print(Time)</code></pre>
<pre><code>## [1] 0.0000000 0.3452055 0.6520548 1.1123288 1.3808219 1.6109589 2.0712329
## [8] 2.3780822 2.5698630</code></pre>
<p>Next, we assign the adjacency matrix and dissimilarity metric (both discussed in Part I).</p>
<pre class="r"><code>W <- HFAII_Queen[-blind_spot, -blind_spot] # visual field adjacency matrix
DM <- GarwayHeath[-blind_spot] # Garway-Heath angles</code></pre>
<p>Now that we have specified the data objects <code>Y</code>, <code>DM</code>, <code>W</code>, and <code>Time</code>, we will customize the objects that characterize Bayesian MCMC methods, in particular, hyperparameters, starting values, Metropolis tuning values, and MCMC inputs. These objects have been detailed previously in the <code>womblR</code> package <a href="https://cran.r-project.org/web/packages/womblR/vignettes/womblR-example.html">vignette</a>, so we will not spend time going over their definitions. We will only note that they are each <code>list</code> objects similar to the <code>spBayes</code> package. We begin by specifying the hyperparameters.</p>
<pre class="r"><code>###Bounds for temporal tuning parameter phi
TimeDist <- abs(outer(Time, Time, "-"))
TimeDistVec <- TimeDist[lower.tri(TimeDist)]
minDiff <- min(TimeDistVec)
maxDiff <- max(TimeDistVec)
PhiUpper <- -log(0.01) / minDiff # shortest diff goes down to 1%
PhiLower <- -log(0.95) / maxDiff # longest diff goes up to 95%
###Hyperparameter object
Hypers <- list(Delta = list(MuDelta = c(3, 0, 0), OmegaDelta = diag(c(1000, 1000, 1))),
T = list(Xi = 4, Psi = diag(3)),
Phi = list(APhi = PhiLower, BPhi = PhiUpper))</code></pre>
<p>Then we specify the starting values for the parameters, Metropolis tuning variances, and MCMC details.</p>
<pre class="r"><code>###Starting values
Starting <- list(Delta = c(3, 0, 0), T = diag(3), Phi = 0.5)
###Metropolis tuning variances
Nu <- length(Time) # calculate number of visits
Tuning <- list(Theta2 = rep(1, Nu), Theta3 = rep(1, Nu), Phi = 1)
###MCMC inputs
MCMC <- list(NBurn = 10000, NSims = 250000, NThin = 25, NPilot = 20)</code></pre>
<p>We specify that our model will run for a burn-in period of 10,000 scans, followed by 250,000 scans post burn-in. In the burn-in period there will be 20 iterations of pilot adaptation evenly spaced out over the period. The final number of samples to be used for inference will be thinned down to 10,000 based on the thinning number of 25. We can now run the MCMC sampler. Details of the various options available in the sampler can be found in the documentation, <code>help(STBDwDM)</code>.</p>
<pre class="r"><code>reg.STBDwDM <- STBDwDM(Y = Y, DM = DM, W = W, Time = Time,
Starting = Starting, Hypers = Hypers, Tuning = Tuning, MCMC = MCMC,
Family = "tobit",
TemporalStructure = "exponential",
Distance = "circumference",
Weights = "continuous",
Rho = 0.99,
ScaleY = 10,
ScaleDM = 100,
Seed = 54)
## Burn-in progress: |*************************************************|
## Sampler progress: 0%.. 10%.. 20%.. 30%.. 40%.. 50%.. 60%.. 70%.. 80%.. 90%.. 100%.. </code></pre>
<p>We quickly assess convergence by checking the traceplots of <span class="math inline">\(\alpha_t\)</span> (note that further MCMC convergence diagnostics should be used in practice).</p>
<pre class="r"><code>###Load coda package
library(coda)
###Convert alpha to an MCMC object
Alpha <- as.mcmc(reg.STBDwDM$alpha)
###Create traceplot
par(mfrow = c(3, 3))
for (t in 1:Nu) traceplot(Alpha[, t], ylab = bquote(alpha[.(t)]), main = bquote(paste("Posterior of " ~ alpha[.(t)])))</code></pre>
<p><img src="/post/2018-12-03-statistics-in-glaucoma-part-ii_files/figure-html/unnamed-chunk-8-1.png" width="689.28" /></p>
</div>
<div id="converting-mcmc-samples-into-clinical-statements" class="section level2">
<h2>Converting MCMC Samples into Clinical Statements</h2>
<p>Now we calculate the posterior distribution of the CV of <span class="math inline">\(\alpha_t\)</span> and print its moments.</p>
<pre class="r"><code>CVAlpha <- apply(Alpha, 1, function(x) sd(x) / mean(x))
plot(density(CVAlpha, adjust = 2), main = expression("Posterior of CV"~(alpha[t])), xlab = expression("CV"~(alpha[t])))</code></pre>
<p><img src="/post/2018-12-03-statistics-in-glaucoma-part-ii_files/figure-html/unnamed-chunk-9-1.png" width="50%" style="display: block; margin: auto;" /></p>
<pre class="r"><code>STCV <- c(mean(CVAlpha), sd(CVAlpha), quantile(CVAlpha, probs = c(0.025, 0.975)))
names(STCV)[1:2] <- c("Mean", "SD")
print(STCV)</code></pre>
<pre><code>## Mean SD 2.5% 97.5%
## 0.19121622 0.10205826 0.04636219 0.42744656</code></pre>
<p>For this information to be useful clinically, we convert it into a probability of progression based on a model trained on a large cohort of glaucoma patients (Berchuck et al. 2019). Because the information from <span class="math inline">\(\alpha_t\)</span> is independent of trend-based methods, we show that the optimal use of <span class="math inline">\(\alpha_t\)</span> is combining it with a basic global metric that includes the slope and p-value (and their interaction) of the overall mean at each visual field exam. The trained model coefficients are publicly available and are used below. Furthermore, both the mean, standard deviation, and their interaction of the CV of <span class="math inline">\(\alpha_t\)</span> are included. The probability of progression can be calculated as follows.</p>
<pre class="r"><code>###Calculate the global metric slope and p-value
MeanSens <- apply(t(matrix(VFSeries$DLS, ncol = Nu)) / 10, 1, mean) # scaled mean DLS
reg.global <- lm(MeanSens ~ Time) # global regression
GlobalS <- summary(reg.global)$coef[2, 1] # global slope
GlobalP <- summary(reg.global)$coef[2, 4] # global p-value
###Obtain probabiltiy of progression using estimated parameters from Berchuck et al. 2019
input <- c(1, GlobalP, GlobalS, STCV[1], STCV[2], GlobalS * GlobalP, STCV[1] * STCV[2])
coef <- c(-1.7471655, -0.2502131, -13.7317622, 7.4746348, -8.9152523, 18.6964153, -13.3706058)
fit <- input %*% coef
exp(fit) / (1 + exp(fit))</code></pre>
<pre><code>## [,1]
## [1,] 0.4355997</code></pre>
<p>The probability of progression is calculated to be 0.44, which can be compared to the threshold cutoff for the trained model of 0.325. This cutoff for the probability of progression was determined using operating characteristics, so that the specificity was forced to be in the clinically meaningful range of 85%. Based on this derived threshold, the probability of progression is high enough to indicate that this patient’s disease shows evidence of visual field progression (which is reassuring, because we know this patient has progression as determined by clinicians).</p>
<p><code>Looking ahead:</code> The third installment will wrap up the discussion on the <code>womblR</code> package and ponder future directions for the role of statistics in glaucoma research. Furthermore, the role of open-source software in medicine will be discussed.</p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<ol style="list-style-type: decimal">
<li>Berchuck, S.I., Mwanza, J.C., & Warren, J.L. (2018). <a href="https://arxiv.org/abs/1805.11636"><em>Diagnosing Glaucoma Progression with Visual Field Data Using a Spatiotemporal Boundary Detection Method</em></a>, In press at <em>Journal of the American Statistical Association</em>.</li>
<li>Womble, W. H. (1951). <a href="http://science.sciencemag.org/content/114/2961/315"><em>Differential Systematics</em></a>. <em>Science</em>, 114(2961), 315-322.</li>
<li>Berchuck, S.I., Mwanza, J.C., Tanna, A.P., Budenz, D.L., Warren, J.L. (2019). <em>Improved Detection of Visual Field Progression Using a Spatiotemporal Boundary Detection Method</em>. In press at <em>Scientific Reports</em> (Available upon request).</li>
</ol>
</div>
<script>window.location.href='https://rviews.rstudio.com/2018/12/07/statistics-in-glaucoma-part-ii/';</script>
Statistics in Glaucoma: Part I
https://rviews.rstudio.com/2018/12/03/statistics-in-glaucoma-part-i/
Mon, 03 Dec 2018 00:00:00 +0000https://rviews.rstudio.com/2018/12/03/statistics-in-glaucoma-part-i/
<p><em>Samuel Berchuck is a Postdoctoral Associate in Duke University’s Department of Statistical Science and Forge-Duke’s Center for Actionable Health Data Science.</em></p>
<p><em>Joshua L. Warren is an Assistant Professor of Biostatistics at Yale University.</em></p>
<div id="introduction" class="section level2">
<h2>Introduction</h2>
<p>Glaucoma is a leading cause of blindness worldwide, with a prevalence of 4% in the population aged 40-80. The disease is characterized by retinal ganglion cell death and corresponding damage to the optic nerve head. Since visual impairment caused by glaucoma is irreversible and efficient treatments exist, early detection of the disease is essential. Determining if the disease is progressing remains one of the most challenging aspects of glaucoma management, since it is difficult to distinguish true progression from variability due to natural degradation or noise. In practice, clinicians monitor progression using a multifactorial approach that relies on various measurements of the disease. In this series of blog posts, we focus on the use of visual fields. Visual field examinations obtain levels of a patient’s actual vision, and the practice is thus referred to as a functional measurement. As such, visual fields are a proxy for a patient’s quality of life, and therefore are typically prioritized in practice.</p>
</div>
<div id="visual-field-data" class="section level2">
<h2>Visual Field Data</h2>
<p>Visual fields are complex spatiotemporal data generated from an intricate anatomical system, which is important to understand for modeling purposes. To illustrate visual field data, we load an example data set from the <code>womblR</code> package on CRAN. The package <code>womblR</code> was developed specifically for analyzing visual field data, and uses a Bayesian hierarchical model that accounts for the complex nature of the data (more details will be provided in Part II). The specific data set comes from the Vein Pulsation Study Trial in Glaucoma and the Lions Eye Institute trial registry, Perth, Western Australia. We begin by loading the package.</p>
<pre class="r"><code>library(womblR)</code></pre>
<p>The data set of interest is loaded lazily and can be accessed as follows; we also view the first six rows for illustration.</p>
<pre class="r"><code>data(VFSeries)
head(VFSeries)</code></pre>
<pre><code>## Visit DLS Time Location
## 1 1 25 0 1
## 2 2 23 126 1
## 3 3 23 238 1
## 4 4 23 406 1
## 5 5 24 504 1
## 6 6 21 588 1</code></pre>
<p>The data object <code>VFSeries</code> contains a longitudinal series of visual fields for a glaucoma patient that we will use throughout the three blog posts to exemplify the study of visual fields. This patient has been determined to be progressing, based on the expertise of two clinicians. <code>VFSeries</code> has four variables: <code>Visit</code>, <code>DLS</code>, <code>Time</code>, and <code>Location</code>. The variable <code>Visit</code> represents the visual field test visit number, <code>DLS</code> the observed measure, <code>Time</code> the time of the visual field test (in days from baseline visit), and <code>Location</code> the spatial location on the visual field where the observation occurred. There are 9 visual field exams contained in this data set, and on average 117.25 days between visits.</p>
<p>To help visualize the dataframe, we can use the <code>PlotVFTimeSeries</code> function. <code>PlotVFTimeSeries</code> is a function that plots a patient’s observed visual acuity over time at each location on the visual field.</p>
<pre class="r"><code>PlotVfTimeSeries(Y = VFSeries$DLS,
Location = VFSeries$Location,
Time = VFSeries$Time,
main = "Visual field sensitivity time series \n at each location",
xlab = "Days from baseline visit",
ylab = "Differential light sensitivity (dB)",
line.reg = FALSE)</code></pre>
<p><img src="/post/2018-11-19-statistics-in-glaucoma-part-i_files/figure-html/unnamed-chunk-3-1.png" width="528" style="display: block; margin: auto;" /></p>
<p>The above figure demonstrates the visual field from a Humphrey Field Analyzer-II (HFA-II) testing machine, which generates 54 spatial locations (only 52 informative locations; note the 2 blanks spots corresponding to the blind spot). The visual field map is constructed by assessing a patient’s response to varying levels of light. Patients are instructed to focus on a central fixation point as light is introduced randomly in a preceding manner over a grid on the visual field. As light is observed, the patient presses a button and the current light intensity is recorded. The process is repeated until the entire visual field is tested. The light intensity is measured in differential light sensitivity (DLS), which quantifies the difference in the HFA-II background and observed light intensity. Smaller values indicate worsening vision.</p>
</div>
<div id="spatial-anatomy-on-the-visual-field" class="section level2">
<h2>Spatial Anatomy on the Visual Field</h2>
<p>The spatial surface of the visual field is observed on a lattice (i.e., uniform areal data); however, it is a complex projection of the underlying optic nerve head and exhibits anatomically induced spatial dependencies. In particular, localized damage to the optic disc can result in clinically deterministic deterioration across the visual field. Incorporating this non-standard spatial dependence structure into our methodology is a priority for properly analyzing these data, although it is commonly ignored. Translating this into math lingo, this means that a naive modeling of the spatial surface of the visual field would be inappropriate (i.e., neighbors defined through adjacent locations). Instead, the definition of a neighbor when considering vision loss on the visual field must depend on the underlying anatomical proximities.</p>
<p>To illustrate this concept, we begin by displaying the visual field neighborhood structure. The adjacency matrix for the HFA-II is available in the <code>womblR</code> package. In this analysis, we use a queen specification, meaning that an adjacency is defined as any location that shares an edge or corner on the lattice. We now load this adjacency matrix and remove the two locations that correspond to the blind spot.</p>
<pre class="r"><code>blind_spot <- c(26, 35) # define blind spot
W <- HFAII_Queen[-blind_spot, -blind_spot] # HFA-II visual field adjacency matrix</code></pre>
<p>This adjacency structure can be displayed using the <code>graph.adjacency</code> function in the <code>igraph</code> package.</p>
<pre class="r"><code>library(igraph)
adj.graph <- graph.adjacency(W, mode = "undirected")
plot(adj.graph)</code></pre>
<p><img src="/post/2018-11-19-statistics-in-glaucoma-part-i_files/figure-html/unnamed-chunk-5-1.png" width="528" style="display: block; margin: auto;" /></p>
<p>As mentioned above, naively assuming that all of these adjacencies are equal ignores the important underlying anatomy that enforces these dependencies. This anatomical relationship of the visual field test points and the underlying optic nerve head was studied by Garway-Heath et al. (2000), in which they estimated the angle that each test location’s underlying retinal ganglion cells enters the optic disc, measured in degrees. These angles are the missing link that will allow the visual field adjacency structure to be dictated by the underlying anatomy. These angles can be visualized using the function <code>PlotAdjacency</code> from <code>womblR</code>, which displays neighborhood structures across the visual field. Before using this function, we need to load the angles measured in Garway-Heath et al. (2000). These are available from <code>womblR</code>; again, we remove the blind spot before using.</p>
<pre class="r"><code>Angles <- GarwayHeath[-blind_spot] # Garway-Heath angles
summary(Angles)</code></pre>
<pre><code>## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 80.75 192.50 177.35 275.75 329.00</code></pre>
<p>We are now ready to visualize the neighborhood structure of the visual field using the <code>PlotAdjacency</code> function.</p>
<pre class="r"><code>###Plot the angles on the visual field
PlotAdjacency(W = W,
DM = Angles,
zlim = c(0, 180),
Visit = NA,
edgewidth = 3.75,
cornerwidth = 0.33,
lwd.border = 3.75,
main = "Garway-Heath angles\n across the visual field")</code></pre>
<p><img src="/post/2018-11-19-statistics-in-glaucoma-part-i_files/figure-html/unnamed-chunk-7-1.png" width="528" style="display: block; margin: auto;" /></p>
<p>The angles measured by Garway-Heath et al. are presented at each location on the visual field. More interestingly, the distances between these angles are presented for each of the neighbor pairs. This figure is equivalent to the adjacency plot displayed above, but allows the adjacencies to vary as a function of the anatomy. In particular, if two visual field locations are anatomically similar, the dependency is strengthened (i.e., more white), and if the locations are close to anatomically independent, the dependency is weaker (i.e., more black). Here the edge adjacencies are represented by lines, while the diagonal adjacencies are represented as two triangles. This view of the visual field details the anatomical importance in modeling visual field data, as neighboring locations can have underlying retinal ganglion cells that enter the optic nerve head with a large degree of separation. In particular, locations on either side of the equator, although adjacent, are anatomically close to independent based on anatomy.</p>
</div>
<div id="how-to-model-visual-field-data" class="section level2">
<h2>How to Model Visual Field Data?</h2>
<p>If you have gotten this far in the post, hopefully you have the sense that the study of visual field data is statistically interesting and clinically important for properly assessing a glaucoma patient’s risk of vision loss. In the next two blog posts, we will explore how visual field data are currently analyzed and new methods that account for the anatomical structure detailed above. To accomplish this, we will break down the algorithm and software used to build the <code>womblR</code> package, and will attempt to illustrate the importance of R packages for open-source clinical research.</p>
</div>
<div id="reference" class="section level2">
<h2>Reference</h2>
<ol style="list-style-type: decimal">
<li>Garway-Heath, David F., Darmalingum Poinoosawmy, Frederick W. Fitzke, and Roger A. Hitchings. “Mapping the visual field to the optic disc in normal tension glaucoma eyes” <em>Ophthalmology</em> 107, no. 10 (2000): 1809-1815.</li>
</ol>
</div>
<script>window.location.href='https://rviews.rstudio.com/2018/12/03/statistics-in-glaucoma-part-i/';</script>