Survival Analysis on R Views

Multistate Models for Medical Applications

Wed, 19 Apr 2023 00:00:00 +0000

Clinical research studies and healthcare economics studies are frequently concerned with assessing the prognosis for survival in circumstances where patients suffer from a disease that progresses from state to state. Standard survival models only directly model two states: alive and dead. Multi-state models enable directly modeling disease progression where patients are observed to be in various states of health or disease at random intervals, but for which, except for death, the times of entering or leaving states are unknown. Multi-state models easily accommodate interval censored intermediate states while making the usual assumption that death times are known but may be right censored.

A natural way to conceptualize modeling the dynamics of disease progression with interval censored states is as continuous time Markov chains. The following diagram illustrates a possible disease progression model where there is some possibility of dying from any state, but otherwise a patient would progress from being healthy, to mild disease, to severe disease and then perhaps death.

## Loading required package: shape

It is true that in that the Markov assumption implies that the time patients spend in the various states are exponentially distributed. However, the mathematical theory of stochastic multi-state processes is very rich and can accommodate more realistic models with state dependent hazard rates that vary over time and other relaxations of the Markov assumption. Moreover, there is robust software in R (and other languages) that make multi-state stochastic survival models practical.

In the remainder of this post, I present a variation of a disease progression model discussed by Ardo van den Hout in some detail in his incredibly informative and very readable monograph Multi-State Survival Models for Interval Censored Data . Also note that van den Hout’s model is itself an elaboration of the main example presented by Christopher Jackson in the vignette to his msmpackage. This post presents a slower development of the model developed by Jackson and van den Hout that might be easier for a person not already familiar with the msm package to follow.

library(tidyverse)
library(tidymodels)
library(msm)

The Data

The data set explored by both Jackson and van den Hout is the Cardiac Allograft Vasculopathy (CAV) data set which contains the individual histories of angiographic examinations of 622 heart transplant recipients collected at the Papworth Hospital in the United Kingdom. This data is included in the msm package and is a good candidate to be the iris dataset for progressive disease models. It is a rich data set with 2846 rows and multiple covariates, including patient age and time time since transplant, both of which can be use for time scales, multiple state transitions among four states and no missing values. Observations of intermediate states are interval censored and have been recorded varying time intervals. Deaths are “exact” or right censored.

The following code creates a new variable that preserves the original state data for each observation and displays the data in tibble format.

set.seed(1234)
df <- tibble(cav) %>% mutate(o_state = state)

df

## # A tibble: 2,846 × 11
##     PTNUM   age years  dage   sex pdiag cumrej state firstobs statemax o_state
##     <int> <dbl> <dbl> <int> <int> <fct>  <int> <int>    <int>    <dbl>   <int>
##  1 100002  52.5  0       21     0 IHD        0     1        1        1       1
##  2 100002  53.5  1.00    21     0 IHD        2     1        0        1       1
##  3 100002  54.5  2.00    21     0 IHD        2     2        0        2       2
##  4 100002  55.6  3.09    21     0 IHD        2     2        0        2       2
##  5 100002  56.5  4       21     0 IHD        3     2        0        2       2
##  6 100002  57.5  5.00    21     0 IHD        3     3        0        3       3
##  7 100002  58.4  5.85    21     0 IHD        3     4        0        4       4
##  8 100003  29.5  0       17     0 IHD        0     1        1        1       1
##  9 100003  30.7  1.19    17     0 IHD        1     1        0        1       1
## 10 100003  31.5  2.01    17     0 IHD        1     3        0        3       3
## # ℹ 2,836 more rows

The state table which presents the number of times each pair of states were observed in successive observation times shows that 46 transitions from state 2 (Mild CAV) to state 1 (No CAV), 4 transitions from state 3 (Severe CAV) to Healthy and 13 transitions from Severe CAV to Mild CAV.

statetable.msm(state = state, subject = PTNUM, data = df)

##     to
## from    1    2    3    4
##    1 1367  204   44  148
##    2   46  134   54   48
##    3    4   13  107   55

I will follow van den Hout and assume these backward transitions are misclassified and alter the state variable so there is no back sliding. The following code does this in a tidy way and also creates a new variable b_age which records the baseline age at which patients entered the study. (Note: you can find van den Hout’s code here)

df1 <- df %>% group_by(PTNUM) %>% 
                     mutate(b_age = min(age),
                            state = cummax(state)
                     )

This transformation will make the state transition table conform to the diagram above, but with state 1 representing No CAV rather than Health.

statetable.msm(state = state, subject = PTNUM, data = df1)

##     to
## from    1    2    3    4
##    1 1336  185   40  139
##    2    0  220   52   49
##    3    0    0  140   63

Setting Up and Running the Model

The next step is to set up the model using the function msm() whose great flexibility means that some care must be taken to set parameter values.

First, we set up the initial guess for the intensity matrix, Q, which determines the transition rates among states for a continuous time Markov chain. For the msm function, positive values indicate possible transitions.

# Intensity matrix Q:
q <- 0.01
Q <- rbind(c(0,q,0,q), c(0,0,q,q),c(0,0,0,q),c(0,0,0,0))
qnames <- c("q12","q14","q23","q24","q34")

Next, we set up the covariate structure which van den Hout discusses in his monograph, but does not show in the code on the book’s website referenced above. For this model, transitions from state 1 to state 2 and from state 1 to state 4 depend on time,years, the age of the patient at transplant time b_age, and dage, the age of the donor. The other transitions depend only on dage. So, we see that msm() can deal with time varying covariates as well as permitting individual state transitions to be driven by different covariates.

covariates = list("1-2" = ~ years + b_age + dage , 
                  "1-4" = ~ years + b_age + dage ,
                  "2-3" = ~ dage,
                  "2-4" = ~ dage,
                  "3-4" = ~ dage)

Now, we set the remaining parameters for the model.

obstype <- 1
center <- FALSE
deathexact <- TRUE
method <- "BFGS"
control <- list(trace = 0, REPORT = 1)

obstype = 1 indicates that observations have been taken at arbitrary time points. They are snapshots of the process that are common for panel data.

center = FALSE means that covariates will not be centered at their means during the maximum likelihood estimation process. The default for this parameter is TRUE.

deathexact = TRUE indicates that the final absorbing state is exactly observed. This is the defining assumption survival data. In msm this is equivalent to setting obstupe = 3 for state 4, our absorbing state.

** method = BFGS** signals optim() to use the optimization method published simultaneously in 1970 by Broyden, Fletcher, Goldfarb and Shanno. (look here). This is a quasi-Newton method that uses function values and gradients to build up a picture of the surface to be optimized.

control = list(trace=0,REPORT=1) indicates more parameters that will be passed to optim().

REPORT sets the the frequency of reports for the “BFGS”, “L-BFGS-B” and “SANN” methods if control$trace is positive. Defaults to every 10 iterations for “BFGS” and “L-BFGS-B”, or every 100 temperatures for “SANN”. (Note: SANN is a variant of the simulated annealing method presented by C. J. P. Belisle (1992) Convergence theorems for a class of simulated annealing algorithms on R^d Journal of Applied Probability.)

trace is also passed tooptim(). trace must be a non-negative integer. If positive, tracing information on the progress of the optimization is produced. Higher values may produce more tracing information.

model_1 <- msm(state~years, subject = PTNUM, data = df1, center= center, 
             qmatrix=Q, obstype = obstype, deathexact = deathexact, method = method,
             covariates = covariates, control = control)

First check to see if the model has converged. For the BFGS method, possible convergence codes returned by optim() are: 0 indicates convergence, 1 indicates that the maximum iteration limit has been reached, 51 and 52 indicate warnings.

#Model Status
conv <- model_1$opt$convergence; cat("Convergence code =", conv,"\n")

## Convergence code = 0

Next, look at a measure of how well the model fits the data proposed by using a visual test proposed by Gentleman et al. (1994) which plots the observed numbers of individuals occupying a state at a series of times against forecasts from the fitted model, for each state. The msm function plot.prevalence.msm() produces a perfectly adequate base R plot. However, to emphasize that msm users are not limited to base R plots, I’ll do a little extra work to use ggplot(). When a package author is kind enough to provide an extractor function you can do anything you want with the data.

The prevalence.msm() function extracts both the observed and forecast prevalence matrices.

prev <- prevalence.msm(model_1)

This not very elegant, but straightforward code reshapes the data and plots.

# reshape observed prevalence
do1 <-as_tibble(row.names(prev$Observed)) %>% rename(time = value) %>% 
          mutate(time = as.numeric(time))
do2 <-as_tibble(prev$Observed) %>% mutate(type = "observed")
do <- cbind(do1,do2) %>% select(-Total)
do_l <- do %>% gather(state, number, -time, -type)
# reshape expected prevalence
de1 <-as_tibble(row.names(prev$Expected)) %>% rename(time = value) %>% 
          mutate(time = as.numeric(time))
de2 <-as_tibble(prev$Expected) %>% mutate(type = "expected")
de <- cbind(de1,de2) %>% select(-Total) 
de_l <- de %>% gather(state, number, -time, -type) 

# bind into a single data frame
prev_l <-rbind(do_l,de_l) %>% mutate(type = factor(type),
                                     state = factor(state),
                                     time = round(time,3))


# plot for comparison
prev_gp <- prev_l %>% group_by(state)
pp <- prev_l %>% ggplot() +
     geom_line(aes(time, number, color = type)) +
     xlab("time") +
     ylab("") +
     ggtitle("")
pp + facet_wrap(~state)

The agreement of the observed and forecast prevalence for states 1 through 3 look pretty good. After about 8 years the observed deaths are notably higher than the forecast. As Jackson points out (See the msm Manual page 33), this kind of discrepancy could indicate that the underlying process is not homogeneous. I have attempted to capture this non-homogeneity by having some of the transitions depend on time. And, although the plot above looks a little better that than the plot in the manual, which does not attempt to model non-homogeneity, it is apparent that there is room to find a better model!

Survival Curves and Calculations

Now, we can jump straight to the major result and look at the fitted survival curves. There is a plot() method for msm that will directly plot these curves. However, just to emphasize that if a package author is kind enough to provide a plot method, it will probably not be too difficult to hack the code for the method to use an alternative plotting system. To save space, I will not show my code, but you can easily recreate it by stating with the plot.msm() function, deleting the plotting parts and returning the values for time and the states “Health, Mild_CAV, and Severe_CAV which are used int the code below. Check my hack by running plot(model_1)

# plot_prep was obtained from plot.msm()
res <- plot_prep(model_1)
time <- res[[1]]
Health <- res[[2]]
Mild_CAV <- res[[3]]
Severe_CAV <- res[[4]]
df_w <- tibble(time,Health, Mild_CAV, Severe_CAV)
df_l <- df_w %>% gather("state", "survival", -time)
p <- df_l %>% ggplot(aes(time, 1 - survival, group = state)) +
     geom_line(aes(color = state)) +
     xlab("Years") +
     ylab("Probability") +
     ggtitle("Fitted Survival Probabilities")
p

These curves indicate that a treatment that could prevent CAV or at least delay progression from mild CAV to severe CAV might prolong survival. Additionally, the Markov structure of the model permits extracting information that relates to disease progression and the total time spent in each state.

The function totlos.msm() estimates the total expected time that a patient will spend in each state. Parameter settings for this function include:

start = c(1,0,0,0) specifies that patients will start in state 1 with probability 0.

fromt = 0 indicates starting at the beginning of the process.

covariates = “mean” indicates that the covariates will be set to their mean values for the calculation.

total_state_time <-totlos.msm(model_1,start = c(1,0,0,0), from = 0, covariates = "mean")
total_state_time

## State 1 State 2 State 3 State 4 
##   7.002   2.473   1.621     Inf

The table indicates that the mean time a patient can expect to avoid CAV is about 7 years. After progressing to a Mild_CAV, a patient can expect five additional years.

A more direct calculation based on the intensity matrix, Q, give the expected time to the “absorbing” state, Death, from each of the “transient” living states.

time_to_death <- efpt.msm(model_1, tostate = 4, covariates = "mean")
time_to_death

## [1] 11.097  5.836  3.005  0.000

This agrees with the total state times calculated above.

sum(total_state_time[1:3])

## [1] 11.1

Within the scope of the information provided by the covariates, it is also possible to generate more individualized forecasts. For example, here is the expected time to death for a person starting off with no CAV at age 60, who received a heart from a 20 year old donor, 5 years after the transplant.

efpt.msm(model_1, tostate = 4, start = c(1,0,0,0), covariates = list(years = 5, b_age = 60, dage = 20))

##       [,1]
## [1,] 7.953

A related quantity, mean sojourn time, is the mean time that each visit to each state is expected to last. Since, we are assuming a progressive disease model where each patient visits each state only once, the estimate should be close to total time spent in each state. However, Jackson notes that in a progressive model, sojourn time in the disease state will be greater than the expected length of stay in the disease state because the mean sojourn time in a state is conditional on entering the state, whereas the expected total time in a diseased state is a forecast for an individual, who may die before getting the disease. (See help(totlos.msm)). And indeed, that is what we see here for states 2 and 3.

sojourn.msm(model_1, covariates="mean")

##         estimates     SE     L     U
## State 1     7.002 0.4024 6.256 7.837
## State 2     3.525 0.3020 2.980 4.169
## State 3     3.005 0.3748 2.353 3.837

Hazard Ratios

model_1 will also product estimates of hazard ratios which show the estimate effect on transition intensities for each state.

Here is the table of Hazard Ratios:

model_1

## 
## Call:
## msm(formula = state ~ years, subject = PTNUM, data = df1, qmatrix = Q,     obstype = obstype, covariates = covariates, deathexact = deathexact,     center = center, method = method, control = control)
## 
## Maximum likelihood estimates
## Baselines are with covariates set to 0
## 
## Transition intensities with hazard ratios for each covariate
##                   Baseline                         years              
## State 1 - State 1 -0.032750 (-0.0607897,-0.017644)                    
## State 1 - State 2  0.030957 ( 0.0160470, 0.059721) 1.112 (1.061,1.166)
## State 1 - State 4  0.001793 ( 0.0004703, 0.006836) 1.093 (1.012,1.182)
## State 2 - State 2 -0.395633 (-0.6488723,-0.241227)                    
## State 2 - State 3  0.264310 ( 0.1488153, 0.469441) 1.000              
## State 2 - State 4  0.131323 ( 0.0330133, 0.522385) 1.000              
## State 3 - State 3 -0.434548 (-0.9113857,-0.207192)                    
## State 3 - State 4  0.434548 ( 0.2071918, 0.911386) 1.000              
##                   b_age                dage                 
## State 1 - State 1                                           
## State 1 - State 2 1.001 (0.9884,1.014) 1.0281 (1.0159,1.040)
## State 1 - State 4 1.053 (1.0271,1.079) 1.0208 (1.0039,1.038)
## State 2 - State 2                                           
## State 2 - State 3 1.000                0.9932 (0.9756,1.011)
## State 2 - State 4 1.000                0.9757 (0.9298,1.024)
## State 3 - State 3                                           
## State 3 - State 4 1.000                0.9906 (0.9672,1.015)
## 
## -2 * log-likelihood:  3466

The table shows that time, the covariate years, affects disease progression represented by the the transition from state 1 to state 2, but has a smaller effect on the transition from state 1 to state 4.

The covariate b_age, the baseline age of patient at transplant time has a larger effect on dying before the onset of CAV than on the transition to CAV.

The covariate dage has a minor effect on the transitions from state 1 but apparently has no effect thereafter.

The hazard ratios are computed by calculating the exponential of the estimated covariate effects on the log-transition intensities for the Markov process which are stored in the model object.

To see how these work, first look at the baseline hazard ratios.

model_1$Qmatrices$baseline

##          State 1  State 2 State 3  State 4
## State 1 -0.03275  0.03096  0.0000 0.001793
## State 2  0.00000 -0.39563  0.2643 0.131323
## State 3  0.00000  0.00000 -0.4345 0.434548
## State 4  0.00000  0.00000  0.0000 0.000000

These baseline hazard ratios are computed from the model intensity matrix, Q, assuming no covariates. They can also be directly extracted from the model by qmatrix.msm() extractor function with covariates set to zero.

qmatrix.msm(model_1,  covariates = 0)

##         State 1                          State 2                         
## State 1 -0.032750 (-0.0607897,-0.017644)  0.030957 ( 0.0160470, 0.059721)
## State 2 0                                -0.395633 (-0.6488723,-0.241227)
## State 3 0                                0                               
## State 4 0                                0                               
##         State 3                          State 4                         
## State 1 0                                 0.001793 ( 0.0004703, 0.006836)
## State 2  0.264310 ( 0.1488153, 0.469441)  0.131323 ( 0.0330133, 0.522385)
## State 3 -0.434548 (-0.9113857,-0.207192)  0.434548 ( 0.2071918, 0.911386)
## State 4 0                                0

The 95% confidence limits are computed by assuming normality of the log-effect.

A more representative value for the intensity matrix for this model can be obtained by setting the covariates to their expected mean values.

qmatrix.msm(model_1,  covariates = "mean")

##         State 1                      State 2                     
## State 1 -0.14281 (-0.15984,-0.12760)  0.10019 ( 0.08742, 0.11482)
## State 2 0                            -0.28369 (-0.33556,-0.23984)
## State 3 0                            0                           
## State 4 0                            0                           
##         State 3                      State 4                     
## State 1 0                             0.04262 ( 0.03339, 0.05441)
## State 2  0.21821 ( 0.17636, 0.26999)  0.06548 ( 0.03872, 0.11075)
## State 3 -0.33281 (-0.42498,-0.26064)  0.33281 ( 0.26064, 0.42498)
## State 4 0                            0

Next, we may want to examine the contribution of the covariate covariates to the hazard ratios. To take a particular example, look at the dage to the hazard ratios

model_1$Qmatrices$dage

##         State 1 State 2   State 3   State 4
## State 1       0 0.02771  0.000000  0.020575
## State 2       0 0.00000 -0.006783 -0.024625
## State 3       0 0.00000  0.000000 -0.009439
## State 4       0 0.00000  0.000000  0.000000

and focus on the contribution of dage to the intensity matrix for the transition from state 3 to state 4 which is given as -0.009439 in the table above. Taking the exponential of this value, yields the hazard ratio for the dage state 3 to 4 transition in the hazard ratio’s table we got by printing out model_1 above.

exp(-.009439)

## [1] 0.9906

The hazard ratio tables for the remaining covariates are given by:

model_1$Qmatrices$years

##         State 1 State 2 State 3 State 4
## State 1       0  0.1064       0 0.08933
## State 2       0  0.0000       0 0.00000
## State 3       0  0.0000       0 0.00000
## State 4       0  0.0000       0 0.00000

and

model_1$Qmatrices$b_age

##         State 1   State 2 State 3 State 4
## State 1       0 0.0009645       0 0.05152
## State 2       0 0.0000000       0 0.00000
## State 3       0 0.0000000       0 0.00000
## State 4       0 0.0000000       0 0.00000

Exploring Transition Probabilities and Intensities

It is also possible to look at the state transition matrix at different times and see how these probabilities change over time. Here we compute the transition matrix at 1 year.

pmatrix.msm(model_1, t = 1, covariates = "mean")

##         State 1 State 2  State 3 State 4
## State 1  0.8669 0.08101 0.008493 0.04357
## State 2  0.0000 0.75300 0.160340 0.08666
## State 3  0.0000 0.00000 0.716903 0.28310
## State 4  0.0000 0.00000 0.000000 1.00000

and at 5 years.

pmatrix.msm(model_1, t = 5, covariates = "mean")

##         State 1 State 2 State 3 State 4
## State 1  0.4897  0.1761 0.07871  0.2556
## State 2  0.0000  0.2421 0.23419  0.5237
## State 3  0.0000  0.0000 0.18937  0.8106
## State 4  0.0000  0.0000 0.00000  1.0000

Additionally, it is possible to examine the effect of covariates on transition probabilities. Here are the 5 year transition probabilities for a patient with a baseline age of 35.

pmatrix.msm(model_1, t = 5, covariates = list(years = 5, b_age = 35, dage = 20))

##         State 1 State 2 State 3 State 4
## State 1  0.5474  0.1674 0.07575  0.2095
## State 2  0.0000  0.2112 0.21622  0.5726
## State 3  0.0000  0.0000 0.16547  0.8345
## State 4  0.0000  0.0000 0.00000  1.0000

and those who had the procedure at age 60.

pmatrix.msm(model_1, t = 5, covariates = list(years = 5, b_age = 60, dage = 20))

##         State 1 State 2 State 3 State 4
## State 1  0.3863  0.1409 0.06784  0.4050
## State 2  0.0000  0.2112 0.21622  0.5726
## State 3  0.0000  0.0000 0.16547  0.8345
## State 4  0.0000  0.0000 0.00000  1.0000

Note that the transitions from CAV states are unaffected.

To summarize: Continuous Time Markov Chains provide a natural framework for working with multi-state survival models. The msm package is sufficiently sophisticated to permit modeling clinical process with level of fidelity that may provide insight about clinically observed disease progression. The software is relatively easy to use and there is plenty of documentation to help you get started.

Learning More About Multi-State Survival Models

To dive deeper into multi-state survival models, I am sure you will find Ardo van den Hout’ Multi-State Survival Models for Interval-Censored Data extraordinarily helpful. There are many good textbooks about the basics of Continuous Time Markov Chains. I recommend J.R.Norris’ - Markov Chains which is still modestly priced. There are also many expositions freely available on the internet including:

David F. Anderson - Chapter 6: Continuous Time Markov Chains from Lecture Notes on Stochastic Processes with Applications in Biology
Miranda Holmes-Cerfon - Lecture 4: Continuous-time Markov Chains
Søren Feodor Nielsen - Continuous-time homogeneous Markov chains
Karl Sigman - Continuous-Time Markov Chains

Beneath and Beyond the Cox Model

Tue, 06 Sep 2022 00:00:00 +0000

The Cox Proportional Hazards model has so dominated survival analysis over the past forty years that I imagine quite a few people who regularly analyze survival data might assume that the Cox model, along with the Kaplan-Meier estimator and a few standard parametric models, encompass just about everything there is to say about the subject. It would not be surprising if this were true because it is certainly the case that these tools have dominated the teaching of survival analysis. Very few introductory textbooks look beyond the Cox Model and the handful of parametric models built around Gompertz, Weibull and logistic functions. But why do Cox models work so well? What is the underlying theory? How do all the pieces of the standard survival tool kit fit together?

As it turns out, Kaplan-Meier estimators, the Cox Proportional Hazards model, Aalen-Johansen estimators, parametric models, multistate models, competing risk models, frailty models and almost every other survival analysis technique implemented in the vast array of R packages comprising the CRAN Survival Task View, are supported by an elegant mathematical theory that formulates time-to-event analyses as stochastic counting models. The theory is about thirty years old. It was initiated by Odd Aalen in his 1975 Berkeley PhD dissertation, developed over the following twenty years largely by Scandinavian statisticians and their collaborators, and set down in more or less complete form in two complementary textbooks [5 and 9] by 1993. Unfortunately, because of its dependency on measure theory, martingales, stochastic integrals and other notions from advanced mathematics, it does not appear that the counting process theory of survival analysis has filtered down in a form that is readily accessible by practitioners.

In this rest of this post, I would like to suggest a path for getting a working knowledge of this theory by introducing two very readable papers, which taken together, provide an excellent overview of the relationship of counting processes to some familiar aspects of survival analysis.

The first paper, The History of applications of martingales in survival analysis by Aalen, Andersen, Borgan, Gill, and Keiding [4] is a beautiful historical exposition of the counting process theory by master statisticians who developed a good bit of the theory themselves. Read through this paper in an hour of so and you will have an overview of the theory, see elementary explanations for some of the mathematics involved, and gain a working idea of how the major pieces of the theory fit together how they came together.

The second paper, Who needs the Cox model anyway? [7], is actually a teaching note put together by Bendix Carstensen. It is a lesson with an attitude and the R code to back it up. Carstensen demonstrates the equivalence of the Cox model to a particular Poisson regression model. Working through this note is like seeing a magic trick and then learning how it works.

The following reproduces a portion of Carstensen’s note. I provide some commentary and fill in a few elementary details in the hope that I can persuade you that it is worth the trouble to spend some time with it yourself.

Carstensen use the North Central Cancer Treatment Group lung cancer survival data set which is included in the survival package for his examples.

## # A tibble: 228 × 10
##     inst  time status   age   sex ph.ecog ph.karno pat.karno meal.cal wt.loss
##    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
##  1     3   306      2    74     1       1       90       100     1175      NA
##  2     3   455      2    68     1       0       90        90     1225      15
##  3     3  1010      1    56     1       0       90        90       NA      15
##  4     5   210      2    57     1       1       90        60     1150      11
##  5     1   883      2    60     1       0      100        90       NA       0
##  6    12  1022      1    74     1       1       50        80      513       0
##  7     7   310      2    68     2       2       70        60      384      10
##  8    11   361      2    71     2       2       60        80      538       1
##  9     1   218      2    53     1       1       70        80      825      16
## 10     7   166      2    61     1       2       70        70      271      34
## # … with 218 more rows

It may not be obvious at first because there is no subject ID column, but this data frame contains one row for each of 228 subjects. The first column is an institution code. time is the time of death or censoring. status is the censoring indicator. The remaining columns are covariates. I select time, status, sex and age, drop the others from the our working data frame, and then replicate Carstensen’s preprocessing in a tidy way. The second line of mutate() adds a small number to each event time to avoid ties.

set.seed(1952)
lung <- lung %>% select(time, status, age, sex) %>% 
                  mutate(sex = factor(sex,labels=c("M","F")),
                         time = time + round(runif(nrow(lung),-3,3),2))

To get a feel for the data we fit a Kaplan-Meier Curve stratified by sex.

surv.obj <- with(lung, Surv(time, status == 2))
fit.by_sex <- survfit(surv.obj ~ sex, data = lung, conf.type = "log-log")
autoplot(fit.by_sex,
          xlab = "Survival Time (Days) ", 
          ylab = "Survival Probabilities",
         main = "Kaplan-Meier plot of lung data by sex") +  
 theme(plot.title = element_text(hjust = 0.5))

Next, following Carstensen, I fit the baseline Cox model to be used in the model comparisons below.

m0.cox <- coxph( Surv( time, status==2 ) ~ age + sex, data=lung )
summary( m0.cox )

## Call:
## coxph(formula = Surv(time, status == 2) ~ age + sex, data = lung)
## 
##   n= 228, number of events= 165 
## 
##          coef exp(coef) se(coef)     z Pr(>|z|)   
## age   0.01705   1.01720  0.00922  1.85   0.0643 . 
## sexF -0.52033   0.59433  0.16751 -3.11   0.0019 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      exp(coef) exp(-coef) lower .95 upper .95
## age      1.017      0.983     0.999     1.036
## sexF     0.594      1.683     0.428     0.825
## 
## Concordance= 0.603  (se = 0.025 )
## Likelihood ratio test= 14.4  on 2 df,   p=7e-04
## Wald test            = 13.8  on 2 df,   p=0.001
## Score (logrank) test = 14  on 2 df,   p=9e-04

The hazard ratios for age and sexF are given in the output column labeled exp(coef). As Carstensen points out mortality increases by 1.7% per year of age at diagnosis and that women have 40% lower mortality than men.

Carstensen next shows that this model can be exactly replicated by a particular and somewhat peculiar Poisson model. Doing this requires a shift in how time is conceived. In the Kaplan-Meier estimator and the Cox model, time is part of the response vector. In the counting process formulation, time is a covariate. Time is divided into many small intervals of length h in which an individuals “exit status” , d is recorded. d will be 1 if death occurred or 0 otherwise. The h intervals represent an individual’s risk time. The pair (d, h) are used to calculate an empirical rate for the process which corresponds to the theoretical rate:

λ(t) = lim_h→0 P[event in (t, t + h)| at risk at time t]/h (*)

The first step in formulating a Poisson model is to set up a data structure that will allow for this more nuanced treatment of time. The function Lexis from the Epi package creates an object of class Lexis, a data frame with columns that will be used to distinguish event time (death or censoring time) from the time intervals in which subjects are at risk for the event. Collectively, these intervals span period from when the first until the last recorded time.

Lung <- Epi::Lexis( exit = list( tfe=time ),
  exit.status = factor( status, labels=c("Alive","Dead") ),
  data = lung )

## NOTE: entry.status has been set to "Alive" for all.
## NOTE: entry is assumed to be 0 on the tfe timescale.

head(Lung)

##  lex.id tfe lex.dur lex.Cst lex.Xst   time status age sex
##       1   0   308.7   Alive    Dead  308.7      2  74   M
##       2   0   457.4   Alive    Dead  457.4      2  68   M
##       3   0  1008.6   Alive   Alive 1008.6      1  56   M
##       4   0   212.1   Alive    Dead  212.1      2  57   M
##       5   0   885.5   Alive    Dead  885.5      2  60   M
##       6   0  1023.7   Alive   Alive 1023.7      1  74   M

The new variables are:

lex.id - Subject ID
tfe - Time from entry at the beginning of the follow-up interval
lex.dur - Duration of the follow-up interval
lex.Cst - Entry status (Alive in our case)
lex.Xst - Exit status at the end of the follow-up interval: tfe + lex.dur (Either Alive or Dead in our case)

Next, the time variable is sorted to produce a vector of endpoints for the at risk intervals and a new time-split Lexis data frame is created using the splitMulti() function from the popEpi package.

Lung.s <- splitMulti( Lung, tfe=c(0,sort(unique(Lung$time))) )
head(Lung.s)

##  lex.id   tfe lex.dur lex.Cst lex.Xst  time status age sex
##       1  0.00    7.67   Alive   Alive 308.7      2  74   M
##       1  7.67    1.88   Alive   Alive 308.7      2  74   M
##       1  9.55    0.23   Alive   Alive 308.7      2  74   M
##       1  9.78    0.57   Alive   Alive 308.7      2  74   M
##       1 10.35    2.25   Alive   Alive 308.7      2  74   M
##       1 12.60    0.45   Alive   Alive 308.7      2  74   M

summary( Lung.s )

##        
## Transitions:
##      To
## From    Alive Dead  Records:  Events: Risk time:  Persons:
##   Alive 25941  165     26106      165      69632       228

tfe tracks the time from entry into the study. This is calendar time.

Carstensen then fits a Cox model to both the Lexis data set and the time-split Lexis data set and notes that the results match the original baseline Cox model. This is as one would expect since the three different data frames contain the same information. Nevertheless, it is a pleasant surprise that the coxph() and Surv() functions are flexible enough to assimilate the three different input data formats.

mL.cox <- coxph( Surv( tfe, tfe+lex.dur, lex.Xst=="Dead" ) ~ age + sex, eps=10^-11, iter.max=25, data=Lung )
mLs.cox <- coxph( Surv( tfe, tfe+lex.dur, lex.Xst=="Dead" ) ~ age + sex, eps=10^-11, iter.max=25, data=Lung.s )
round( cbind( ci.exp(m0.cox), ci.exp(mL.cox), ci.exp(mLs.cox) ), 6 )

##      exp(Est.)  2.5%  97.5% exp(Est.)  2.5%  97.5% exp(Est.)  2.5%  97.5%
## age     1.0172 0.999 1.0357    1.0172 0.999 1.0357    1.0172 0.999 1.0357
## sexF    0.5943 0.428 0.8253    0.5943 0.428 0.8253    0.5943 0.428 0.8253

Now, Carstensen executes what would seem to be a very strange modeling maneuver. He turns calender time, tfe into a factor and fits a Cox model with tfe as a covariate.

mLs.pois.fc <- glm( cbind(lex.Xst=="Dead",lex.dur) ~ 0 + factor(tfe) + age + sex, family=poisreg, data=Lung.s )

An important technical point is that the time intervals in equation (*) above do not satisfy the independence assumption for a Poisson regression model. Nevertheless, the standard glm() machinery can be used to fit the model because, as Carstensen demonstrates, the likelihood function for the conditional probabilities is proportional to the partial likelihood function of the Cox model.

cbind( ci.exp(mLs.cox),ci.exp( mLs.pois.fc, subset=c("age","sex") ) )

##      exp(Est.)  2.5%  97.5% exp(Est.)  2.5%  97.5%
## age     1.0172 0.999 1.0357    1.0172 0.999 1.0357
## sexF    0.5943 0.428 0.8253    0.5943 0.428 0.8253

Carstensen concludes that this demonstrates that the Cox model is equivalent to a specific Poisson model which has one rate parameter for each time internal, and emphasizes that this is not a new result. He notes that the equivalence was demonstrated some time ago, theoretically by Theodore Holford [10], and in practice, by John Whitehead [14]. Also, in a vignette [15] for the survival package Zhong et al. state that this trick may be used to approximate a Cox model.

Carstensen then demonstrates that more practical Poisson models can be fit by using splines to decrease the number of at risk intervals. The first uses a spline basis with arbitrary knot locations and the second fits a penalized spline gam model.

splines

t.kn <- c(0,25,100,500,1000) 
mLs.pois.sp <- glm( cbind(lex.Xst=="Dead",lex.dur) ~ Ns(tfe,knots=t.kn) + age + sex, family=poisreg, data=Lung.s )

Penalized splines

mLs.pois.ps <- mgcv::gam( cbind(lex.Xst=="Dead",lex.dur) ~ s(tfe) + age + sex, family=poisreg, data=Lung.s )

Carstensen finishes up this portion of his analysis by noting the similarity of the estimates of age and sex effects from the different models.

ests <-
 rbind( ci.exp(m0.cox),
 ci.exp(mLs.cox),
 ci.exp(mLs.pois.fc,subset=c("age","sex")),
 ci.exp(mLs.pois.sp,subset=c("age","sex")),
 ci.exp(mLs.pois.ps,subset=c("age","sex")) )

cmp <- cbind( ests[c(1,3,5,7,9) ,],
 ests[c(1,3,5,7,9)+1,] )

rownames( cmp ) <-
 c("Cox","Cox-split","Poisson-factor","Poisson-spline","Poisson-penSpl")

 colnames( cmp )[c(1,4)] <- c("age","sex")
round( cmp,5 )

##                  age   2.5% 97.5%    sex   2.5%  97.5%
## Cox            1.017 0.9990 1.036 0.5943 0.4280 0.8253
## Cox-split      1.017 0.9990 1.036 0.5943 0.4280 0.8253
## Poisson-factor 1.017 0.9990 1.036 0.5943 0.4280 0.8253
## Poisson-spline 1.016 0.9980 1.035 0.5993 0.4316 0.8322
## Poisson-penSpl 1.016 0.9983 1.035 0.6021 0.4338 0.8358

This demonstration provides some convincing evidence that both parametric and non-parametric models are part of a single underlying theory! When you think about it, this is an astonishing idea. To further explore the counting process theory of survival models, I provide a definition of Aalen’s multiplicative intensity model and a list of references below that I hope you will find helpful.

Finally, there is much more to Carstensen’s note than I have presented. He goes on to provide a fairly complete analysis of the lung data while looking at cumulative rates, survival, practical time splitting, time varying coefficients and more ideas along the way.

Appendix: Multiplicative Intensity Model

For some direct insight into how the Cox Proportional Hazards model fits into the counting process theory have a look at Odd Aalen’s definition of the multiplicative intensity model. Aalan begins his landmark 1978 paper Nonparametric Inference For a Family of Counting Processes] [1] by defining the fundamental components of his multiplicative intensity model.

Let N = (N₁, . . . N_k) be a multivariate counting process which is a collection of univariate counting processes on the interval [0,t] each of which counts events in [0,t]. The N_i may depend on each other. Let σ(F_t) be the sigma algebra which represents the collection of all events that can be determined to have happened by time, t. Let α = α₁, . . . α be an unknown, non-negative function and let Y = (Y₁, . . . Y_k) be a process observable over [0,t].

Define Λ_i(t) = lim_h→0E(N_i(t + h) - N_i(t) | F_t )/h i = 1, … k, to be the the intensity process of the counting process N.

Then the multiplicative intensity model is defined to be: Λ_i(t) = α_i(t)Y_i(t).

This last line certainly looks like the Cox model, and it is not to difficult to confirm that this is indeed the case. You can find the gory details in Fleming and Harrington [9 p 126] or comprehensive monograph by Andersen et al. [5 p481].

References

I believe that the following references (annotated with a few comments) comprise a reasonable basis from gaining familiarity with the counting process approach to survival modeling.

Aalen (1978) Odd O. Aalen Nonparametric Inference For a Family of Counting Processes. The Annals of Statistics 1978, vol 6, no 4, 701-726 This is the ‘Ur’ paper for the multiplicative intensity process. At least the first half should be approachable with some knowledge of measure theory and conditional expectation.
Aalen & Johansen (1978) Odd O. Allen and Soren Johansen. An Empirical Transition Matrix for Non-homogenous Markov Chains Based on Censored Observations. Scand J Statistics 5: 141-150, 1978. This is the source of the Aalen-Johansen estimator. The etm package provides an implementation.
Aalen et al. (2008) Odd O. Aalen, Ørnulf Borgan and Håkon K. Gjessing. Survival and Event History Analysis; A Process Point of View Springer Verlag 2008. This is definitely the text to read first. It is comprehensive, takes a modern point of view, is well written, and introduces the difficult mathematics without all of the technical details that often slow down the process of learning some new mathematics.
Aalen et al. (2009). Odd O. Aalen, Per Kragh Andersen, Ørnulf Borgan, Richard D. Gill and Niels Keiding. History of applications of martingales in survival analysis vol 5, no 1, June 2009
Andersen et al. (1993) Per Kragh Andersen, Ørnulf Borgan, Richard D. Gill, Niels Keiding. Statistical Models Based on Counting Processes. Springer-Verlag, 1993 This text presents numerous examples along with a discussion of the theory and emphasizes parametric models.
Borgan (1997). Ørnulf Borgan Three contributions to the Encyclopedia of Biostatistics: The Nelson-Aalen, Kaplan-Meier, and Aalen-Johansen estimators - Very clear summaries.
Carstensen (2019) Bendex Carstensen. Who needs the Cox model anyway?
Cox (1972) Regression Models and Life-Tables JRSS Vol. 34, No.2, 187-200
Fleming & Harrington (1991). Thomas R. Fleming and David P. Harrington. Counting Processes & Survival Analysis, John Wiley & Sons, Inc. 1991 This text book develops all of the math needed and goes on to study non-parametric models including the Cox model.
Holford. Life table with concomitant information. Biometrics, 32:587{597, 1976.
Mohammed (2019). Introduction to Survival Analysis using R
Threneau (2022). The survival package (vignette)
Therneau et al. (1990). Martingale based residuals for survival models. Biometrika 77, 147-160. Martingale residuals appear to be the one use case where martingales openly surface in the survival calculations.
Whitehead (1980). Fitting Cox’s regression model to survival data using GLIM. Applied Statistics, 29(3):268{275, 1980.
Zhong et al. (2019). Approximating a Cox Model. This is a survival package vignette.

Biologically Plausible Fake Survival Data

Mon, 02 Nov 2020 00:00:00 +0000

In two recent posts, one on the Disease Progression Model and the other on Fake Data, I highlighted some of R's tools for simulating data that exhibit desired correlations and other statistical properties. In this post, I’ll focus on a small cluster of R packages that support generating biologically plausible survival data.

Background

In an impressive paper Simulating biologically plausible complex survival data Crowther & Lambert (2013) that combines survival analysis theory and numerical methods, Michael Crowther and Paul Lambert address the problem of simulating plausible data in which event time, censuring and covariate distributions are plausible. They develop a methodology for conducting survival analysis studies, and also provide computational tools for moving beyond the usual exponential, Weibull and Gompertz models. Building on the work by Bender et al. (2005) in establishing a framework for simulating survival data for Cox proportional hazards models, Crowther and Lambert discuss how modelers can incorporate non proportional model hazards, time varying effects, delayed entry and random effects and provide code examples based on the Stata survsim package.

The `survsim` package

Not long after the Stata package appeared, Moriña and Navarro released the R survsim package which implements some of the features in the Stata package for simulating complex survival data. The R package does not have a vignette, but you can find several examples in the JSS paper Moriña & Navarro (2014).

The following example from section 4.3 of the paper simulates adverse events for a clinical trial with 100 patients followed up for 30 days. The authors suggest that the three covariates x could represent body mass index, age at entry to the cohort, and whether or not the subject has hypertension. This is a little bit unusual and sophisticated example of survival modeling.

set.seed(12345)
dist.ev <- c("weibull", "llogistic", "weibull")
anc.ev <- c(0.8, 0.9, 0.82)
beta0.ev <- c(3.56, 5.94, 5.78)
beta <- list(c(-0.04, -0.02, -0.01), c(-0.001, -0.0008, -0.0005),c(-0.7, -0.2, -0.1))
x <- list(c("normal", 26, 4.5), c("unif", 50, 75), c("bern", 0.25))
clinical.data <- mult.ev.sim(n = 100,      # number of patients in cohort
                            foltime = 30,  # maximal followup time
                            dist.ev,       # time to event distributions (t.e.d.)
                            anc.ev,        # parameters for t.d.e. distributions
                            beta0.ev,      # beta0 parameters for t.d.e. dist 
                            dist.cens = "weibull", #censoring distribution
                            anc.cens = 1,  # parameters for censoring dist
                            beta0.cens = 5.2, # beta0 for censoring dist
                            z = list(c("unif", 0.6, 1.4)), # random effect dist
                            beta, # effect of covariate
                            x, # distributions of covariates
                            nsit = 3) # max number of adverse events for an individual
head(round(clinical.data,2))

##   nid ev.num  time status start  stop    z     x   x.1 x.2
## 1   1      1  5.79      1     0  5.79 0.97 28.63 69.02   1
## 2   1      2 30.00      0     0 30.00 0.97 28.63 69.02   1
## 3   1      3 30.00      0     0 30.00 0.97 28.63 69.02   1
## 4   2      1  3.37      1     0  3.37 0.60 36.42 53.81   0
## 5   2      2 30.00      0     0 30.00 0.60 36.42 53.81   0
## 6   2      3 30.00      0     0 30.00 0.60 36.42 53.81   0

The `simsurv` package

In the vignette on How to use the simsurv package, the package authors Sam Brilleman and Alessandro Gasparini state that they directly modeled their package on the Stata packagesurvsim and cite the Crowther and Lambert paper. They show how survsim builds out much of the functionality envisioned there in examples that demonstrate the interplay between model fitting and simulation. Example 2 of the vignette is concerned with constructing fake data modeled on the German breast cancer data by Schumacher et al. (1994).

data("brcancer")
head(brcancer)

##   id hormon rectime censrec
## 1  1      0    1814       1
## 2  2      1    2018       1
## 3  3      1     712       1
## 4  4      1    1807       1
## 5  5      0     772       1
## 6  6      0     448       1

The example begins by fitting alternative models to the data using functions from the flexsurv package of Jackson, Metcalfe and Amdahl. Two candidate models are proposed and a spline model giving the best fit is used to simulate data. The example concludes with more model fitting to examine the fake data. All of the examples in the vignette showcase the interplay between simsurv and flexsurv functions and emphasize the flexible modeling tools in flexsruv for building custom survival models.

The following code replicates the portion of Example 2 that illustrates the use of the flexsurvspline() function which allows the calculation of the log cumulative hazard function to depend on knot locations.

The code below produces the simulated data and uses the survminer package of Kassambara et al. to produce high quality Kaplan-Meier plots.

This line of code fits a three knot spline model to the brcancer data. The flexsurvspline() function, as with other functions in the flexsurv package build on the basic functionality of the fundamental Terry Therneau’s survival package.

true_mod <- flexsurv::flexsurvspline(Surv(rectime, censrec) ~ hormon, 
                                     data = brcancer, k = 3)

This helper function returns the log cumulative hazard at time t

logcumhaz <- function(t, x, betas, knots) {
  
  # Obtain the basis terms for the spline-based log
  # cumulative hazard (evaluated at time t)
  basis <- flexsurv::basis(knots, log(t))
  
  # Evaluate the log cumulative hazard under the
  # Royston and Parmar specification
  res <- 
    betas[["gamma0"]] * basis[[1]] + 
    betas[["gamma1"]] * basis[[2]] +
    betas[["gamma2"]] * basis[[3]] +
    betas[["gamma3"]] * basis[[4]] +
    betas[["gamma4"]] * basis[[5]] +
    betas[["hormon"]] * x[["hormon"]]
  
  res
}

The simsurv() functions generates the simulated survival data.

covariates <- data.frame(id = 1:686, hormon = rbinom(686, 1, 0.5))
sim_data <- simsurv(
               betas = true_mod$coefficients, # "true" parameter values
               x = covariates,            # covariate data for 686 individuals
               knots = true_mod$knots,    # knot locations for splines
               logcumhazard = logcumhaz,  # definition of log cum hazard
               maxt = NULL,               # no right-censoring
               interval = c(1E-8,100000)) # interval for root finding
sim_data <- merge(covariates, sim_data)
head(sim_data)

##   id hormon eventtime status
## 1  1      1     240.4      1
## 2  2      0     942.7      1
## 3  3      1     463.5      1
## 4  4      0    1762.1      1
## 5  5      0    3976.4      1
## 6  6      0    2288.2      1

We use the surv_fit function from the survminer package to fit the Kaplan-Meier curves

KM_data <- survminer::surv_fit(Surv(rectime, censrec) ~ 1, data = brcancer)
KM_data_sim <- survminer::surv_fit(Surv(eventtime, status) ~ 1, data = sim_data)

Finally, plotting the curves shows that the simulsted data does appear to plausibly resemble the original data.

p <- ggsurvplot_combine(list(KM_data, KM_data_sim),
                risk.table = TRUE,
                conf.int = TRUE,
                censor = FALSE,
                conf.int.style = "step",
                tables.theme = theme_cleantable(),
                palette = "jco")

plot.new() 
print(p,newpage = FALSE)

I hope you find this small post helpful. The CRAN task view on Survival Analysis is a fantastic resource, but it can be a daunting task for non-experts to know where to begin to unravel the secrets there without a thread to pull on.

R/Medicine 2019 Workshops

Thu, 12 Sep 2019 00:00:00 +0000

R/Medicine 2019 kicked off on Thursday with two outstanding workshops. It was difficult to choose between the two, but fortunately both presenters developed rich sets of materials that are available online.

Alison Hill delivered R Markdown for Medicine with an elegant HTML exposition masterfully created to cultivate beginners while still engaging experienced R Markdown users.

Photo by Samuel Zeller on Unsplash

In four sections: (1) R Markdown Anatomy, (2) Outputs and Tables, (3) Graphics for Communication and (4) Data and Workflows she developed aspects of R Markdown aimed at statisticians and clinicians writing medical document which should also delight a wide audience of R Markdown users.

In the parallel session, Elizabeth (Beth) Atkinson distilled years of experience Wrangling survival data at the Mayo Clinic while presenting new functionality from version 3.0 of Terry Therneau’s Survival package which contains significant new material on multi-state models. (Version 3.0 is expected to make it to CRAN very soon, but if you can’t wait, you can install the new version from GitHub with: install_github("therneau/survival", dependencies=TRUE). Terry, who will be delivering the opening keynote presentation, also attended the workshop. It was a rare treat to hear Beth and Terry discuss best practices, pitfalls and common errors while fielding questions from the attendees. Beth assembled so much material it will take a “month of Sundays” to work through it all, but I doubt that there is a better source of material anywhere that makes the special difficulties of wrangling survival data more easily accessible. The following gem shows up early in the presentation.

R/Medicine is off to a great start.

Survival Analysis with R

Mon, 25 Sep 2017 00:00:00 +0000

With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, survival analysis is one of the oldest subfields of Statistics [1]. Basic life-table methods, including techniques for dealing with censored data, were discovered before 1700 [2], and in the early eighteenth century, the old masters - de Moivre working on annuities, and Daniel Bernoulli studying competing risks for the analysis of smallpox inoculation - developed the modern foundations of the field [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing, Medicine, and many more application areas. So, it is not surprising that R should be rich in survival analysis functions. CRAN’s Survival Analysis Task View, a curated list of the best relevant R survival analysis packages and functions, is indeed formidable. We all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.

Looking at the Task View on a small screen, however, is a bit like standing too close to a brick wall - left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. Newcomers - people either new to R or new to survival analysis or both - must find it overwhelming. So, it is with newcomers in mind that I offer the following narrow trajectory through the task view that relies on just a few packages: survival, ggplot2, ggfortify, and ranger

The survival package is the cornerstone of the entire R survival analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986. The first public release, in late 1989, used the Statlib service hosted by Carnegie Mellon University. Thereafter, the package was incorporated directly into Splus, and subsequently into R.

ggfortify enables producing handsome, one-line survival plots with ggplot2::autoplot.

ranger might be the surprise in my very short list of survival packages. The ranger() function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But ranger() also works with survival data. Benchmarks indicate that ranger() is suitable for building time-to-event models with the large, high-dimensional data sets important to internet marketing applications. Since ranger() uses standard Surv() survival objects, it’s an ideal tool for getting acquainted with survival analysis in this machine-learning age.

Load the data

This first block of code loads the required packages, along with the veteran dataset from the survival package that contains data from a two-treatment, randomized trial for lung cancer.

library(survival)
library(ranger)
library(ggplot2)
library(dplyr)
library(ggfortify)

#------------
data(veteran)
head(veteran)

##   trt celltype time status karno diagtime age prior
## 1   1 squamous   72      1    60        7  69     0
## 2   1 squamous  411      1    70        5  64    10
## 3   1 squamous  228      1    60        3  38     0
## 4   1 squamous  126      1    60        9  63    10
## 5   1 squamous  118      1    70       11  65    10
## 6   1 squamous   10      1    20        5  49     0

The variables in veteran are: * trt: 1=standard 2=test * celltype: 1=squamous, 2=small cell, 3=adeno, 4=large * time: survival time in days * status: censoring status * karno: Karnofsky performance score (100=good) * diagtime: months from diagnosis to randomization * age: in years * prior: prior therapy 0=no, 10=yes

Kaplan Meier Analysis

The first thing to do is to use Surv() to build the standard survival object. The variable time records survival time; status indicates whether the patient’s death was observed (status = 1) or that survival time was censored (status = 0). Note that a “+” after the time in the print out of km indicates censoring.

# Kaplan Meier Survival Curve
km <- with(veteran, Surv(time, status))
head(km,80)

##  [1]  72  411  228  126  118   10   82  110  314  100+  42    8  144   25+
## [15]  11   30  384    4   54   13  123+  97+ 153   59  117   16  151   22 
## [29]  56   21   18  139   20   31   52  287   18   51  122   27   54    7 
## [43]  63  392   10    8   92   35  117  132   12  162    3   95  177  162 
## [57] 216  553  278   12  260  200  156  182+ 143  105  103  250  100  999 
## [71] 112   87+ 231+ 242  991  111    1  587  389   33

To begin our analysis, we use the formula Surv(futime, status) ~ 1 and the survfit() function to produce the Kaplan-Meier estimates of the probability of survival over time. The times parameter of the summary() function gives some control over which times to print. Here, it is set to print the estimates for 1, 30, 60 and 90 days, and then every 90 days thereafter. This is the simplest possible model. It only takes three lines of R code to fit it, and produce numerical and graphical summaries.

km_fit <- survfit(Surv(time, status) ~ 1, data=veteran)
summary(km_fit, times = c(1,30,60,90*(1:10)))

## Call: survfit(formula = Surv(time, status) ~ 1, data = veteran)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    137       2    0.985  0.0102      0.96552       1.0000
##    30     97      39    0.700  0.0392      0.62774       0.7816
##    60     73      22    0.538  0.0427      0.46070       0.6288
##    90     62      10    0.464  0.0428      0.38731       0.5560
##   180     27      30    0.222  0.0369      0.16066       0.3079
##   270     16       9    0.144  0.0319      0.09338       0.2223
##   360     10       6    0.090  0.0265      0.05061       0.1602
##   450      5       5    0.045  0.0194      0.01931       0.1049
##   540      4       1    0.036  0.0175      0.01389       0.0934
##   630      2       2    0.018  0.0126      0.00459       0.0707
##   720      2       0    0.018  0.0126      0.00459       0.0707
##   810      2       0    0.018  0.0126      0.00459       0.0707
##   900      2       0    0.018  0.0126      0.00459       0.0707

#plot(km_fit, xlab="Days", main = 'Kaplan Meyer Plot') #base graphics is always ready
autoplot(km_fit)

Next, we look at survival curves by treatment.

km_trt_fit <- survfit(Surv(time, status) ~ trt, data=veteran)
autoplot(km_trt_fit)

And, to show one more small exploratory plot, I’ll do just a little data munging to look at survival by age. First, I create a new data frame with a categorical variable AG that has values LT60 and GT60, which respectively describe veterans younger and older than sixty. While I am at it, I make trt and prior into factor variables. But note, survfit() and npsurv() worked just fine without this refinement.

vet <- mutate(veteran, AG = ifelse((age < 60), "LT60", "OV60"),
              AG = factor(AG),
              trt = factor(trt,labels=c("standard","test")),
              prior = factor(prior,labels=c("N0","Yes")))

km_AG_fit <- survfit(Surv(time, status) ~ AG, data=vet)
autoplot(km_AG_fit)

Although the two curves appear to overlap in the first fifty days, younger patients clearly have a better chance of surviving more than a year.

Cox Proportional Hazards Model

Next, I’ll fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data set.

# Fit Cox Model
cox <- coxph(Surv(time, status) ~ trt + celltype + karno                   + diagtime + age + prior , data = vet)
summary(cox)

## Call:
## coxph(formula = Surv(time, status) ~ trt + celltype + karno + 
##     diagtime + age + prior, data = vet)
## 
##   n= 137, number of events= 128 
## 
##                         coef  exp(coef)   se(coef)      z Pr(>|z|)    
## trttest            2.946e-01  1.343e+00  2.075e-01  1.419  0.15577    
## celltypesmallcell  8.616e-01  2.367e+00  2.753e-01  3.130  0.00175 ** 
## celltypeadeno      1.196e+00  3.307e+00  3.009e-01  3.975 7.05e-05 ***
## celltypelarge      4.013e-01  1.494e+00  2.827e-01  1.420  0.15574    
## karno             -3.282e-02  9.677e-01  5.508e-03 -5.958 2.55e-09 ***
## diagtime           8.132e-05  1.000e+00  9.136e-03  0.009  0.99290    
## age               -8.706e-03  9.913e-01  9.300e-03 -0.936  0.34920    
## priorYes           7.159e-02  1.074e+00  2.323e-01  0.308  0.75794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## trttest              1.3426     0.7448    0.8939    2.0166
## celltypesmallcell    2.3669     0.4225    1.3799    4.0597
## celltypeadeno        3.3071     0.3024    1.8336    5.9647
## celltypelarge        1.4938     0.6695    0.8583    2.5996
## karno                0.9677     1.0334    0.9573    0.9782
## diagtime             1.0001     0.9999    0.9823    1.0182
## age                  0.9913     1.0087    0.9734    1.0096
## priorYes             1.0742     0.9309    0.6813    1.6937
## 
## Concordance= 0.736  (se = 0.03 )
## Rsquare= 0.364   (max possible= 0.999 )
## Likelihood ratio test= 62.1  on 8 df,   p=2e-10
## Wald test            = 62.37  on 8 df,   p=2e-10
## Score (logrank) test = 66.74  on 8 df,   p=2e-11

cox_fit <- survfit(cox)
#plot(cox_fit, main = "cph model", xlab="Days")
autoplot(cox_fit)

Note that the model flags small cell type, adeno cell type and karno as significant. However, some caution needs to be exercised in interpreting these results. While the Cox Proportional Hazard’s model is thought to be “robust”, a careful analysis would check the assumptions underlying the model. For example, the Cox model assumes that the covariates do not vary with time. In a vignette [12] that accompanies the survival package Therneau, Crowson and Atkinson demonstrate that the Karnofsky score (karno) is, in fact, time-dependent so the assumptions for the Cox model are not met. The vignette authors go on to present a strategy for dealing with time dependent covariates.

Data scientists who are accustomed to computing ROC curves to assess model performance should be interested in the Concordance statistic. The documentation for the survConcordance() function in the survival package defines concordance as “the probability of agreement for any two randomly chosen observations, where in this case agreement means that the observation with the shorter survival time of the two also has the larger risk score. The predictor (or risk score) will often be the result of a Cox model or other regression” and notes that: “For continuous covariates concordance is equivalent to Kendall’s tau, and for logistic regression is is equivalent to the area under the ROC curve.”

To demonstrate using the survival package, along with ggplot2 and ggfortify, I’ll fit Aalen’s additive regression model for censored data to the veteran data. The documentation states: “The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a time-dependent intercept term, X is the vector of covariates for the subject (possibly time-dependent), and B(t) is a time-dependent matrix of coefficients.”

The plots show how the effects of the covariates change over time. Notice the steep slope and then abrupt change in slope of karno.

aa_fit <-aareg(Surv(time, status) ~ trt + celltype +
                 karno + diagtime + age + prior , 
                 data = vet)
aa_fit

## Call:
## aareg(formula = Surv(time, status) ~ trt + celltype + karno + 
##     diagtime + age + prior, data = vet)
## 
##   n= 137 
##     75 out of 97 unique event times used
## 
##                       slope      coef se(coef)      z        p
## Intercept          0.083400  3.81e-02 1.09e-02  3.490 4.79e-04
## trttest            0.006730  2.49e-03 2.58e-03  0.967 3.34e-01
## celltypesmallcell  0.015000  7.30e-03 3.38e-03  2.160 3.09e-02
## celltypeadeno      0.018400  1.03e-02 4.20e-03  2.450 1.42e-02
## celltypelarge     -0.001090 -6.21e-04 2.71e-03 -0.229 8.19e-01
## karno             -0.001180 -4.37e-04 8.77e-05 -4.980 6.28e-07
## diagtime          -0.000243 -4.92e-05 1.64e-04 -0.300 7.65e-01
## age               -0.000246 -6.27e-05 1.28e-04 -0.491 6.23e-01
## priorYes           0.003300  1.54e-03 2.86e-03  0.539 5.90e-01
## 
## Chisq=41.62 on 8 df, p=1.6e-06; test weights=aalen

#summary(aa_fit)  # provides a more complete summary of results
autoplot(aa_fit)

Random Forests Model

As a final example of what some might perceive as a data-science-like way to do time-to-event modeling, I’ll use the ranger() function to fit a Random Forests Ensemble model to the data. Note however, that there is nothing new about building tree models of survival data. Terry Therneau also wrote the rpart package, R’s basic tree-modeling package, along with Brian Ripley. See section 8.4 for the rpart vignette [14] that contains a survival analysis example.

ranger() builds a model for each observation in the data set. The next block of code builds the model using the same variables used in the Cox model above, and plots twenty random curves, along with a curve that represents the global average for all of the patients. Note that I am using plain old base R graphics here.

# ranger model
r_fit <- ranger(Surv(time, status) ~ trt + celltype + 
                     karno + diagtime + age + prior,
                     data = vet,
                     mtry = 4,
                     importance = "permutation",
                     splitrule = "extratrees",
                     verbose = TRUE)

# Average the survival models
death_times <- r_fit$unique.death.times 
surv_prob <- data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob,mean)

# Plot the survival models for each patient
plot(r_fit$unique.death.times,r_fit$survival[1,], 
     type = "l", 
     ylim = c(0,1),
     col = "red",
     xlab = "Days",
     ylab = "survival",
     main = "Patient Survival Curves")

#
cols <- colors()
for (n in sample(c(2:dim(vet)[1]), 20)){
  lines(r_fit$unique.death.times, r_fit$survival[n,], type = "l", col = cols[n])
}
lines(death_times, avg_prob, lwd = 2)
legend(500, 0.7, legend = c('Average = black'))

The next block of code illustrates how ranger() ranks variable importance.

vi <- data.frame(sort(round(r_fit$variable.importance, 4), decreasing = TRUE))
names(vi) <- "importance"
head(vi)

##          importance
## karno        0.0903
## celltype     0.0323
## diagtime    -0.0012
## trt         -0.0013
## prior       -0.0027
## age         -0.0037

Notice that ranger() flags karno and celltype as the two most important; the same variables with the smallest p-values in the Cox model. Also note that the importance results just give variable names and not level names. This is because ranger and other tree models do not usually create dummy variables.

But ranger() does compute Harrell’s c-index (See [8] p. 370 for the definition), which is similar to the Concordance statistic described above. This is a generalization of the ROC curve, which reduces to the Wilcoxon-Mann-Whitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.

cat("Prediction Error = 1 - Harrell's c-index = ", r_fit$prediction.error)

## Prediction Error = 1 - Harrell's c-index =  0.3087233

An ROC value of .68 would normally be pretty good for a first try. But note that the ranger model doesn’t do anything to address the time varying coefficients. This apparently is a challenge. In a 2011 paper [16], Hamad observes:

However, in the context of survival trees, a further difficulty arises when time–varying effects are included. Hence, we feel that the interpretation of covariate effects with tree ensembles in general is still mainly unsolved and should attract future research.

I believe that the major use for tree-based models for survival data will be to deal with very large data sets.

Finally, to provide an “eyeball comparison” of the three survival curves, I’ll plot them on the same graph.The following code pulls out the survival data from the three model objects and puts them into a data frame for ggplot().

# Set up for ggplot
kmi <- rep("KM",length(km_fit$time))
km_df <- data.frame(km_fit$time,km_fit$surv,kmi)
names(km_df) <- c("Time","Surv","Model")

coxi <- rep("Cox",length(cox_fit$time))
cox_df <- data.frame(cox_fit$time,cox_fit$surv,coxi)
names(cox_df) <- c("Time","Surv","Model")

rfi <- rep("RF",length(r_fit$unique.death.times))
rf_df <- data.frame(r_fit$unique.death.times,avg_prob,rfi)
names(rf_df) <- c("Time","Surv","Model")

plot_df <- rbind(km_df,cox_df,rf_df)

p <- ggplot(plot_df, aes(x = Time, y = Surv, color = Model))
p + geom_line()

For this data set, I would put my money on a carefully constructed Cox model that takes into account the time varying coefficients. I suspect that there are neither enough observations nor enough explanatory variables for the ranger() model to do better.

This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform, which has been under continuous development and improvement for nearly twenty years. The ranger package, which suggests the survival package, and ggfortify, which depends on ggplot2 and also suggests the survival package, illustrate how open-source code allows developers to build on the work of their predecessors. The documentation that accompanies the survival package, the numerous online resources, and the statistics such as concordance and Harrell’s c-index packed into the objects produced by fitting the models gives some idea of the statistical depth that underlies almost everything R.

Some Tutorials and Papers

For a very nice, basic tutorial on survival analysis, have a look at the Survival Analysis in R [5] and the OIsurv package produced by the folks at OpenIntro.

Look here for an exposition of the Cox Proportional Hazard’s Model, and here [11] for an introduction to Aalen’s Additive Regression Model.

For an elementary treatment of evaluating the proportional hazards assumption that uses the veterans data set, see the text by Kleinbaum and Klein [13].

For an exposition of the sort of predictive survival analysis modeling that can be done with ranger, be sure to have a look at Manuel Amunategui’s post and video.

See the 1995 paper [15] by Intrator and Kooperberg for an early review of using classification and regression trees to study survival data.

References

For convenience, I have collected the references used throughout the post here.

[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed., p. 11
[2] Andersen, P.K., Keiding, N. (1998) Survival analysis Encyclopedia of Biostatistics 6. Wiley, pp. 4452-4461 [3] Kaplan, E.L. & Meier, P. (1958). Non-parametric estimation from incomplete observations, J American Stats Assn. 53, pp. 457–481, 562–563. [4] Cox, D.R. (1972). Regression models and life-tables (with discussion), Journal of the Royal Statistical Society (B) 34, pp. 187–220.
[5] Diez, David. Survival Analysis in R, OpenIntro
[6] Klein, John P and Moeschberger, Melvin L. Survival Analysis Techniques for Censored and Truncated Data, Springer. (1997)
[7] Wright, Marvin & Ziegler, Andreas. (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, JSS Vol 77, Issue 1.
[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15 (1996), pp. 361-387 [9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R
[10] NUS Course Notes. Chapter 3 The Cox Proportional Hazards Model
[11] Encyclopedia of Biostatistics, 2nd Edition (2005). Aalen’s Additive Regression Model [12] Therneau et al. Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model
[13] Kleinbaum, D.G. and Klein, M. Survival Analysis, A Self Learning Text Springer (2005) [14] Therneau, T and Atkinson, E. An Introduction to Recursive Partitioning Using RPART Routines
[15] Intrator, O. and Kooperberg, C. Trees and splines in survival analysis Statistical Methods in Medical Research (1995)
[16] Bou-Hamad, I. A review of survival trees Statistics Surveys Vol.5 (2011)

Authors’s note: this post was originally published on April 26, 2017 but was subsequently withdrawn because of an error spotted by Dr. Terry Therneau. He observed that the Cox Portional Hazards Model fitted in that post did not properly account for the time varying covariates. This revised post makes use of a different data set, and points to resources for addressing time varying covariates. Many thanks to Dr. Therneau. Any errors that remain are mine.

Survival Analysis on R Views

Multistate Models for Medical Applications

The Data

Setting Up and Running the Model

Survival Curves and Calculations

Hazard Ratios

Exploring Transition Probabilities and Intensities

Learning More About Multi-State Survival Models

Beneath and Beyond the Cox Model

Appendix: Multiplicative Intensity Model

References

Biologically Plausible Fake Survival Data

Background

The survsim package

The simsurv package

R/Medicine 2019 Workshops

Survival Analysis with R

Load the data

Kaplan Meier Analysis

Cox Proportional Hazards Model

Random Forests Model

Some Tutorials and Papers

References

The `survsim` package

The `simsurv` package