External validation of the performance of competing risks prediction models: a guide through modern methods - Develop a risk prediction model using the subdistribution hazard approach

Installing and loading packages and import data
Descriptive statistics
Develop a competing risks prediction model using the subdistribution hazard approach
Reproducibility ticket

Installing and loading packages and import data

The following libraries are needed to achieve the outlined goals, the code chunk below will a) check whether you already have them installed, b) install them for you if not already present, and c) load the packages into the session.

We loaded the development data (rdata) and the validation data (vdata). More details about development and validation data are provided in the manuscript.

Descriptive statistics

Characteristic	Development data, N = 1,000	Validation data, N = 1,000
Age (years)
Mean (SD)	75 (7)	77 (6)
Median (Range)	74 (65, 95)	76 (70, 96)
Size (cm)
Mean (SD)	2.29 (1.31)	2.13 (1.32)
Median (Range)	2.00 (0.10, 8.50)	1.80 (0.09, 11.00)
Nodal status
negative	642 (64%)	688 (69%)
positive	358 (36%)	312 (31%)
Hormon receptor status
ER and/or PR +	822 (82%)	857 (86%)
ER-/PR-	178 (18%)	143 (14%)

Develop a competing risks prediction model using the subdistribution hazard approach

Cumulative incidence curves

First, we draw the cumulative incidence curves of breast cancer recurrence.

Click to expand code

# Expand datasets -------------------------
rdata.w <- crprep(
  Tstop = "time",
  status = "status_num",
  trans = c(1, 2),
  id = "id",
  keep = c("age", "size", "ncat", "hr_status"),
  data = rdata
)
# Save extended data with weights for recurrence (failcode=1)
# and non recurrence mortality (failcode=2)
rdata.w1 <- rdata.w %>% filter(failcode == 1)
rdata.w2 <- rdata.w %>% filter(failcode == 2)

vdata.w <- crprep(
  Tstop = "time",
  status = "status_num",
  trans = c(1, 2),
  id = "id",
  keep = c("age", "size", "ncat", "hr_status"),
  data = vdata
)
vdata.w1 <- vdata.w %>% filter(failcode == 1)
vdata.w2 <- vdata.w %>% filter(failcode == 2)

# Development set --------
mfit_rdata <- survfit(
  Surv(Tstart, Tstop, status == 1) ~ 1,
  data = rdata.w1, weights = weight.cens
)
mfit_vdata <- survfit(
  Surv(Tstart, Tstop, status == 1) ~ 1,
  data = vdata.w1, weights = weight.cens
)
par(xaxs = "i", yaxs = "i", las = 1)
oldpar <- par(mfrow = c(1, 2), mar = c(5, 5, 1, 1))
plot(mfit_rdata,
  col = 1, lwd = 2,
  xlab = "Years since BC diagnosis",
  ylab = "Cumulative incidence", bty = "n",
  ylim = c(0, 0.25), xlim = c(0, 5), fun = "event", conf.int = TRUE
)
title("Development data")
plot(mfit_vdata,
  col = 1, lwd = 2,
  xlab = "Years since BC diagnosis",
  ylab = "Cumulative incidence", bty = "n",
  ylim = c(0, 0.25), xlim = c(0, 5), fun = "event", conf.int = TRUE
)
title("Validation data")
par(oldpar)
# Cumulative incidences
smfit_rdata <- summary(mfit_rdata, times = c(1, 2, 3, 4, 5))
smfit_vdata <- summary(mfit_vdata, times = c(1, 2, 3, 4, 5))

The R packages and functions cmprsk::cuminc() and mstate::Cuminc()are good and easy alternatives to estimate the cumulative incidence function.

	Development data			Validation data
	Estimate	Lower .95	Upper .95	Estimate	Lower .95	Upper .95
1-year	0.03	0.02	0.04	0.02	0.01	0.03
2-year	0.07	0.05	0.08	0.05	0.03	0.06
3-year	0.10	0.08	0.12	0.07	0.05	0.09
4-year	0.13	0.10	0.15	0.09	0.08	0.11
5-year	0.14	0.11	0.16	0.10	0.08	0.12

The 5-year cumulative incidence of breast cancer recurrence was 14% (95% CI: 11-16%), and 10% (95%CI: 8-12%)

Check non-linearity of continuous predictors

Here we investigate the potential non-linear relation between continuous predictors (i.e. age and size) and the outcomes. We apply three-knot restricted cubic splines using rms::rcs() function (details are given in e.g. Frank Harrell’s book ‘Regression Model Strategies (second edition)’, page 27.

Click to expand code

# Defining knots of the restricted cubic splines ------------------
# Extract knots position of the restricted cubic spline based on the
# original distribution of the data

# Age at breast cancer diagnosis
rcs3_age <- rcspline.eval(rdata$age, nk = 3)
attr(rcs3_age, "dim") <- NULL
attr(rcs3_age, "knots") <- NULL
pos_knots_age <- attributes(rcspline.eval(rdata$age, nk = 3))$knots
rdata$age3 <- rcs3_age

# Size of breast cancer
rcs3_size <- rcspline.eval(rdata$size, nk = 3)
attr(rcs3_size, "dim") <- NULL
attr(rcs3_size, "knots") <- NULL
pos_knots_size <- attributes(rcspline.eval(rdata$size, nk = 3))$knots
rdata$size3 <- rcs3_size

# FG model
dd <- datadist(rdata)
options(datadist = "dd")
fit_fg_rcs <- cph(Surv(Tstart, Tstop, status == 1) ~
rcs(age, pos_knots_age) + rcs(size, pos_knots_size) +
  ncat + hr_status,
weights = weight.cens,
x = T,
y = T,
surv = T,
data = rdata.w1
)
P_fg_age_rcs <- Predict(fit_fg_rcs, "age")
P_fg_size_rcs <- Predict(fit_fg_rcs, "size")
# print(fit_fg_rcs)
# print(summary(fit_fg_rcs))
# print(anova(fit_fg_rcs))

oldpar <- par(mfrow = c(1, 2), mar = c(5, 5, 1, 1))
par(xaxs = "i", yaxs = "i", las = 1)

# FG - age
plot(P_fg_age_rcs$age,
  P_fg_age_rcs$yhat,
  type = "l",
  lwd = 2,
  col = "blue",
  bty = "n",
  xlab = "Age at breast cancer diagnosis",
  ylab = "log Relative Hazard",
  ylim = c(-2, 2),
  xlim = c(65, 95)
)
polygon(c(
  P_fg_age_rcs$age,
  rev(P_fg_age_rcs$age)
),
c(
  P_fg_age_rcs$lower,
  rev(P_fg_age_rcs$upper)
),
col = "grey75",
border = FALSE
)
par(new = TRUE)
plot(P_fg_age_rcs$age,
  P_fg_age_rcs$yhat,
  type = "l",
  lwd = 2,
  col = "blue",
  bty = "n",
  xlab = "Age at breast cancer diagnosis",
  ylab = "log Relative Hazard",
  ylim = c(-2, 2),
  xlim = c(65, 95)
)

# FG - size
par(xaxs = "i", yaxs = "i", las = 1)
plot(P_fg_size_rcs$size,
  P_fg_size_rcs$yhat,
  type = "l",
  lwd = 2,
  col = "blue",
  bty = "n",
  xlab = "Size of breast cancer",
  ylab = "log Relative Hazard",
  ylim = c(-2, 2),
  xlim = c(0, 7)
)
polygon(c(
  P_fg_size_rcs$size,
  rev(P_fg_size_rcs$size)
),
c(
  P_fg_size_rcs$lower,
  rev(P_fg_size_rcs$upper)
),
col = "grey75",
border = FALSE
)
par(new = TRUE)
plot(P_fg_size_rcs$size,
  P_fg_size_rcs$yhat,
  type = "l",
  lwd = 2,
  col = "blue",
  bty = "n",
  xlab = "Size of breast cancer",
  ylab = "log Relative Hazard",
  ylim = c(-2, 2),
  xlim = c(0, 7)
)
par(oldpar)
options(datadist = NULL)

# Fit the Fine and Gray model assuming linear relation of
# the continuous predictors
dd <- datadist(rdata, adjto.cat = "first")
options(datadist = "dd")
fit_fg <- cph(Surv(Tstart, Tstop, status == 1) ~ age + size +
  ncat + hr_status,
weights = weight.cens,
x = T,
y = T,
surv = T,
data = rdata.w1
)
options(datadist = NULL)

	AIC without splines	AIC with splines
Fine and Gray models	1819.695	1819.733

The AIC and the graphical check suggested a potential linear relation between the continuous predictors (age and size) and the event of interest (breast cancer recurrence).

Checking proportional subdistribution hazards assumption

We now examine the fits further by checking the proportionality of the subdistribution hazards of the models.

Click to expand code

zp_fg <- cox.zph(fit_fg, transform = "identity")

par(las = 1, xaxs = "i", yaxs = "i")
# c(bottom, left, top, right)
oldpar <- par(mfrow = c(2, 2), mar = c(5, 6.1, 3.1, 1))
sub_title <- c("Age", "Size", "Lymph node status", "HR status")
for (i in 1:4) {
  plot(zp_fg[i],
    resid = F,
    bty = "n",
    xlim = c(0, 5)
  )
  abline(0, 0, lty = 3)
  title(sub_title[i])
}
mtext("Fine and Gray",
  side = 3,
  line = -1,
  outer = TRUE,
  font = 2
)
par(oldpar)

kable(round(zp_fg$table, 3)) %>%
  kable_styling("striped", position = "center")

	chisq	df	p
age	2.439	1	0.118
size	4.874	1	0.027
ncat	0.177	1	0.674
hr_status	0.127	1	0.721
GLOBAL	6.465	4	0.167

The statistical tests showed a potential violation of the proportional hazards assumption for size of breast cancer. For simplicity we ignore this violation in the remainder.

Model development - fit the risk prediction models

We develop and show the results of the Fine and Gray subdistribution hazard regression model. In R, there are multiple alternatives to fit a competing risks model using the subdistribution hazard approach:

Using survival::coxph() after using survival::finegray();
Using rcs::cph() or survival::coxph() after expanding the data with mstate::crprep();
Using riskRegression::FGR() which is a formula interface for the function crr from the cmprsk package.

Model development using survival package and finegray() function

Click to expand code

rdata_fg <- finegray(Surv(time, status) ~ .,
  etype = "rec",
  data = rdata
)

fgfit <- coxph(Surv(fgstart, fgstop, fgstatus) ~
age + size + ncat + hr_status,
weight = fgwt,
data = rdata_fg
)

Model development using mstate package and crprep() function

Click to expand code

# Expand data to prepare for fitting the model
primary_event <- 1 # primary event is 1 (breast cancer recurrence)
# set 2 to fit a model for non-recurrence mortality.
rdata.w <- mstate::crprep(
  Tstop = "time",
  status = "status_num",
  trans = c(1, 2),
  id = "id",
  keep = c("age", "size", "ncat", "hr_status"),
  data = rdata
)
rdata.w1 <- rdata.w %>% filter(failcode == primary_event)

dd <- datadist(rdata)
options(datadist = "dd")
options(prType = "html")

fit_fg <- cph(Surv(Tstart, Tstop, status == 1) ~
age + size + ncat + hr_status,
weights = weight.cens,
x = T,
y = T,
surv = T,
data = rdata.w1
)
print(fit_fg)
options(datadist = NULL)

Cox Proportional Hazards Model

 cph(formula = Surv(Tstart, Tstop, status == 1) ~ age + size + 
     ncat + hr_status, data = rdata.w1, weights = weight.cens, 
     x = T, y = T, surv = T)

	Model Tests	Discrimination Indexes
Obs 1486	LR χ² 34.31	R² 0.032
Events 135	d.f. 4	D_xy 0.286
Center 0.8068	Pr(>χ²) 0.0000	g 0.534
	Score χ² 39.30	g_r 1.705
	Pr(>χ²) 0.0000

	β	S.E.	Wald Z	Pr(>\|Z\|)
age	0.0002	0.0122	0.02	0.9859
size	0.2044	0.0576	3.55	0.0004
ncat=positive	0.4532	0.1784	2.54	0.0111
hr_status=ER-/PR-	0.5860	0.1927	3.04	0.0024

Model development using riskRegression package and FGR() function

Click to expand code

# We also fit the FG model using riskRegression::FGR()
# because it is useful to evaluate some of the prediction performance
# measures later

primary_event <- 1 # primary event is 1 (breast cancer recurrence)
# set 2 to fit a model for non-recurrence mortality.

fit_fgr <- FGR(Hist(time, status_num) ~
age + size +
  ncat + hr_status,
cause = primary_event,
data = rdata
)

Summary of the model coefficients

Summary of the model coefficients using survival, mstate and survival or rms, and riskRegression packages

	survival package	mstate + survival/rms packages	riskRegression package
age	0.0002	0.0002	0.0002
size	0.2044	0.2044	0.2044
ncatpositive	0.4532	0.4532	0.4532
hr_statusER-/PR-	0.5860	0.5860	0.5860

The coefficients of the models indicated that higher size, positive nodal status and ER- and/or PR- were associated with higher risk to develop a breast cancer recurrence.

Reproducibility ticket

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] splines   stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] webshot_0.5.2             gridExtra_2.3            
##  [3] rsample_0.1.1             forcats_0.5.1            
##  [5] stringr_1.4.0             dplyr_1.0.7              
##  [7] purrr_0.3.4               readr_2.1.1              
##  [9] tidyr_1.1.4               tibble_3.1.6             
## [11] tidyverse_1.3.1           boot_1.3-28              
## [13] gtsummary_1.5.0           kableExtra_1.3.4         
## [15] knitr_1.36                plotrix_3.8-2            
## [17] riskRegression_2021.10.10 pec_2021.10.11           
## [19] prodlim_2019.11.13        pseudo_1.4.3             
## [21] geepack_1.3.3             KMsurv_0.1-5             
## [23] mstate_0.3.2              rms_6.2-0                
## [25] SparseM_1.81              Hmisc_4.6-0              
## [27] ggplot2_3.3.5             Formula_1.2-4            
## [29] lattice_0.20-45           survival_3.2-13          
## [31] pacman_0.5.1             
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1         backports_1.3.0      systemfonts_1.0.3   
##   [4] plyr_1.8.6           listenv_0.8.0        TH.data_1.1-0       
##   [7] digest_0.6.29        foreach_1.5.1        htmltools_0.5.2     
##  [10] fansi_0.5.0          magrittr_2.0.1       checkmate_2.0.0     
##  [13] cluster_2.1.2        tzdb_0.2.0           recipes_0.1.17      
##  [16] globals_0.14.0       modelr_0.1.8         mets_1.2.9          
##  [19] gower_0.2.2          matrixStats_0.61.0   sandwich_3.0-1      
##  [22] svglite_2.0.0        jpeg_0.1-9           colorspace_2.0-2    
##  [25] rvest_1.0.2          haven_2.4.3          xfun_0.28           
##  [28] crayon_1.4.2         jsonlite_1.7.2       zoo_1.8-9           
##  [31] iterators_1.0.13     glue_1.5.1           gtable_0.3.0        
##  [34] ipred_0.9-12         MatrixModels_0.5-0   future.apply_1.8.1  
##  [37] scales_1.1.1         mvtnorm_1.1-3        DBI_1.1.1           
##  [40] Rcpp_1.0.7           viridisLite_0.4.0    cmprsk_2.2-10       
##  [43] htmlTable_2.3.0      foreign_0.8-81       stats4_4.1.2        
##  [46] lava_1.6.10          htmlwidgets_1.5.4    httr_1.4.2          
##  [49] RColorBrewer_1.1-2   ellipsis_0.3.2       pkgconfig_2.0.3     
##  [52] nnet_7.3-16          dbplyr_2.1.1         here_1.0.1          
##  [55] utf8_1.2.2           caret_6.0-90         tidyselect_1.1.1    
##  [58] rlang_0.4.12         reshape2_1.4.4       munsell_0.5.0       
##  [61] cellranger_1.1.0     tools_4.1.2          cli_3.1.0           
##  [64] generics_0.1.1       broom_0.7.10         evaluate_0.14       
##  [67] fastmap_1.1.0        yaml_2.2.1           ModelMetrics_1.2.2.2
##  [70] fs_1.5.1             timereg_2.0.1        future_1.23.0       
##  [73] nlme_3.1-153         quantreg_5.86        xml2_1.3.3          
##  [76] compiler_4.1.2       rstudioapi_0.13      png_0.1-7           
##  [79] gt_0.3.1             reprex_2.0.1         broom.helpers_1.5.0 
##  [82] stringi_1.7.6        highr_0.9            Matrix_1.3-4        
##  [85] vctrs_0.3.8          furrr_0.2.3          pillar_1.6.4        
##  [88] lifecycle_1.0.1      data.table_1.14.2    conquer_1.2.1       
##  [91] R6_2.5.1             latticeExtra_0.6-29  parallelly_1.29.0   
##  [94] codetools_0.2-18     polspline_1.1.19     MASS_7.3-54         
##  [97] assertthat_0.2.1     rprojroot_2.0.2      withr_2.4.3         
## [100] multcomp_1.4-17      parallel_4.1.2       hms_1.1.1           
## [103] grid_4.1.2           rpart_4.1-15         timeDate_3043.102   
## [106] class_7.3-19         rmarkdown_2.11       pROC_1.18.0         
## [109] numDeriv_2016.8-1.1  lubridate_1.8.0      base64enc_0.1-3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!