External validation of the performance of competing risks prediction models: a guide through modern methods - Develop a risk prediction model using the subdistribution hazard approach
- Installing and loading packages and import data
- Descriptive statistics
- Develop a competing risks prediction model using the subdistribution hazard approach
- Reproducibility ticket
The following libraries are needed to achieve the outlined goals, the code chunk below will a) check whether you already have them installed, b) install them for you if not already present, and c) load the packages into the session.
We loaded the development data (rdata) and the validation data (vdata). More details about development and validation data are provided in the manuscript.
Characteristic | Development data, N = 1,000 | Validation data, N = 1,000 |
---|---|---|
Age (years) | ||
Mean (SD) | 75 (7) | 77 (6) |
Median (Range) | 74 (65, 95) | 76 (70, 96) |
Size (cm) | ||
Mean (SD) | 2.29 (1.31) | 2.13 (1.32) |
Median (Range) | 2.00 (0.10, 8.50) | 1.80 (0.09, 11.00) |
Nodal status | ||
negative | 642 (64%) | 688 (69%) |
positive | 358 (36%) | 312 (31%) |
Hormon receptor status | ||
ER and/or PR + | 822 (82%) | 857 (86%) |
ER-/PR- | 178 (18%) | 143 (14%) |
First, we draw the cumulative incidence curves of breast cancer recurrence.
Click to expand code
# Expand datasets -------------------------
rdata.w <- crprep(
Tstop = "time",
status = "status_num",
trans = c(1, 2),
id = "id",
keep = c("age", "size", "ncat", "hr_status"),
data = rdata
)
# Save extended data with weights for recurrence (failcode=1)
# and non recurrence mortality (failcode=2)
rdata.w1 <- rdata.w %>% filter(failcode == 1)
rdata.w2 <- rdata.w %>% filter(failcode == 2)
vdata.w <- crprep(
Tstop = "time",
status = "status_num",
trans = c(1, 2),
id = "id",
keep = c("age", "size", "ncat", "hr_status"),
data = vdata
)
vdata.w1 <- vdata.w %>% filter(failcode == 1)
vdata.w2 <- vdata.w %>% filter(failcode == 2)
# Development set --------
mfit_rdata <- survfit(
Surv(Tstart, Tstop, status == 1) ~ 1,
data = rdata.w1, weights = weight.cens
)
mfit_vdata <- survfit(
Surv(Tstart, Tstop, status == 1) ~ 1,
data = vdata.w1, weights = weight.cens
)
par(xaxs = "i", yaxs = "i", las = 1)
oldpar <- par(mfrow = c(1, 2), mar = c(5, 5, 1, 1))
plot(mfit_rdata,
col = 1, lwd = 2,
xlab = "Years since BC diagnosis",
ylab = "Cumulative incidence", bty = "n",
ylim = c(0, 0.25), xlim = c(0, 5), fun = "event", conf.int = TRUE
)
title("Development data")
plot(mfit_vdata,
col = 1, lwd = 2,
xlab = "Years since BC diagnosis",
ylab = "Cumulative incidence", bty = "n",
ylim = c(0, 0.25), xlim = c(0, 5), fun = "event", conf.int = TRUE
)
title("Validation data")
par(oldpar)
# Cumulative incidences
smfit_rdata <- summary(mfit_rdata, times = c(1, 2, 3, 4, 5))
smfit_vdata <- summary(mfit_vdata, times = c(1, 2, 3, 4, 5))
The R packages and functions cmprsk::cuminc()
and
mstate::Cuminc()
are good and easy alternatives to estimate the
cumulative incidence function.
Development data |
Validation data |
|||||
---|---|---|---|---|---|---|
Estimate | Lower .95 | Upper .95 | Estimate | Lower .95 | Upper .95 | |
1-year | 0.03 | 0.02 | 0.04 | 0.02 | 0.01 | 0.03 |
2-year | 0.07 | 0.05 | 0.08 | 0.05 | 0.03 | 0.06 |
3-year | 0.10 | 0.08 | 0.12 | 0.07 | 0.05 | 0.09 |
4-year | 0.13 | 0.10 | 0.15 | 0.09 | 0.08 | 0.11 |
5-year | 0.14 | 0.11 | 0.16 | 0.10 | 0.08 | 0.12 |
The 5-year cumulative incidence of breast cancer recurrence was 14% (95% CI: 11-16%), and 10% (95%CI: 8-12%)
Here we investigate the potential non-linear relation between continuous
predictors (i.e. age and size) and the outcomes. We apply three-knot
restricted cubic splines using rms::rcs()
function (details are given
in e.g. Frank Harrell’s book ‘Regression Model Strategies (second
edition)’, page 27.
Click to expand code
# Defining knots of the restricted cubic splines ------------------
# Extract knots position of the restricted cubic spline based on the
# original distribution of the data
# Age at breast cancer diagnosis
rcs3_age <- rcspline.eval(rdata$age, nk = 3)
attr(rcs3_age, "dim") <- NULL
attr(rcs3_age, "knots") <- NULL
pos_knots_age <- attributes(rcspline.eval(rdata$age, nk = 3))$knots
rdata$age3 <- rcs3_age
# Size of breast cancer
rcs3_size <- rcspline.eval(rdata$size, nk = 3)
attr(rcs3_size, "dim") <- NULL
attr(rcs3_size, "knots") <- NULL
pos_knots_size <- attributes(rcspline.eval(rdata$size, nk = 3))$knots
rdata$size3 <- rcs3_size
# FG model
dd <- datadist(rdata)
options(datadist = "dd")
fit_fg_rcs <- cph(Surv(Tstart, Tstop, status == 1) ~
rcs(age, pos_knots_age) + rcs(size, pos_knots_size) +
ncat + hr_status,
weights = weight.cens,
x = T,
y = T,
surv = T,
data = rdata.w1
)
P_fg_age_rcs <- Predict(fit_fg_rcs, "age")
P_fg_size_rcs <- Predict(fit_fg_rcs, "size")
# print(fit_fg_rcs)
# print(summary(fit_fg_rcs))
# print(anova(fit_fg_rcs))
oldpar <- par(mfrow = c(1, 2), mar = c(5, 5, 1, 1))
par(xaxs = "i", yaxs = "i", las = 1)
# FG - age
plot(P_fg_age_rcs$age,
P_fg_age_rcs$yhat,
type = "l",
lwd = 2,
col = "blue",
bty = "n",
xlab = "Age at breast cancer diagnosis",
ylab = "log Relative Hazard",
ylim = c(-2, 2),
xlim = c(65, 95)
)
polygon(c(
P_fg_age_rcs$age,
rev(P_fg_age_rcs$age)
),
c(
P_fg_age_rcs$lower,
rev(P_fg_age_rcs$upper)
),
col = "grey75",
border = FALSE
)
par(new = TRUE)
plot(P_fg_age_rcs$age,
P_fg_age_rcs$yhat,
type = "l",
lwd = 2,
col = "blue",
bty = "n",
xlab = "Age at breast cancer diagnosis",
ylab = "log Relative Hazard",
ylim = c(-2, 2),
xlim = c(65, 95)
)
# FG - size
par(xaxs = "i", yaxs = "i", las = 1)
plot(P_fg_size_rcs$size,
P_fg_size_rcs$yhat,
type = "l",
lwd = 2,
col = "blue",
bty = "n",
xlab = "Size of breast cancer",
ylab = "log Relative Hazard",
ylim = c(-2, 2),
xlim = c(0, 7)
)
polygon(c(
P_fg_size_rcs$size,
rev(P_fg_size_rcs$size)
),
c(
P_fg_size_rcs$lower,
rev(P_fg_size_rcs$upper)
),
col = "grey75",
border = FALSE
)
par(new = TRUE)
plot(P_fg_size_rcs$size,
P_fg_size_rcs$yhat,
type = "l",
lwd = 2,
col = "blue",
bty = "n",
xlab = "Size of breast cancer",
ylab = "log Relative Hazard",
ylim = c(-2, 2),
xlim = c(0, 7)
)
par(oldpar)
options(datadist = NULL)
# Fit the Fine and Gray model assuming linear relation of
# the continuous predictors
dd <- datadist(rdata, adjto.cat = "first")
options(datadist = "dd")
fit_fg <- cph(Surv(Tstart, Tstop, status == 1) ~ age + size +
ncat + hr_status,
weights = weight.cens,
x = T,
y = T,
surv = T,
data = rdata.w1
)
options(datadist = NULL)
AIC without splines | AIC with splines | |
---|---|---|
Fine and Gray models | 1819.695 | 1819.733 |
The AIC and the graphical check suggested a potential linear relation between the continuous predictors (age and size) and the event of interest (breast cancer recurrence).
We now examine the fits further by checking the proportionality of the subdistribution hazards of the models.
Click to expand code
zp_fg <- cox.zph(fit_fg, transform = "identity")
par(las = 1, xaxs = "i", yaxs = "i")
# c(bottom, left, top, right)
oldpar <- par(mfrow = c(2, 2), mar = c(5, 6.1, 3.1, 1))
sub_title <- c("Age", "Size", "Lymph node status", "HR status")
for (i in 1:4) {
plot(zp_fg[i],
resid = F,
bty = "n",
xlim = c(0, 5)
)
abline(0, 0, lty = 3)
title(sub_title[i])
}
mtext("Fine and Gray",
side = 3,
line = -1,
outer = TRUE,
font = 2
)
par(oldpar)
kable(round(zp_fg$table, 3)) %>%
kable_styling("striped", position = "center")
chisq | df | p | |
---|---|---|---|
age | 2.439 | 1 | 0.118 |
size | 4.874 | 1 | 0.027 |
ncat | 0.177 | 1 | 0.674 |
hr_status | 0.127 | 1 | 0.721 |
GLOBAL | 6.465 | 4 | 0.167 |
The statistical tests showed a potential violation of the proportional hazards assumption for size of breast cancer. For simplicity we ignore this violation in the remainder.
We develop and show the results of the Fine and Gray subdistribution hazard regression model. In R, there are multiple alternatives to fit a competing risks model using the subdistribution hazard approach:
- Using
survival::coxph()
after usingsurvival::finegray()
; - Using
rcs::cph()
orsurvival::coxph()
after expanding the data withmstate::crprep()
; - Using
riskRegression::FGR()
which is a formula interface for the functioncrr
from thecmprsk
package.
Click to expand code
rdata_fg <- finegray(Surv(time, status) ~ .,
etype = "rec",
data = rdata
)
fgfit <- coxph(Surv(fgstart, fgstop, fgstatus) ~
age + size + ncat + hr_status,
weight = fgwt,
data = rdata_fg
)
Click to expand code
# Expand data to prepare for fitting the model
primary_event <- 1 # primary event is 1 (breast cancer recurrence)
# set 2 to fit a model for non-recurrence mortality.
rdata.w <- mstate::crprep(
Tstop = "time",
status = "status_num",
trans = c(1, 2),
id = "id",
keep = c("age", "size", "ncat", "hr_status"),
data = rdata
)
rdata.w1 <- rdata.w %>% filter(failcode == primary_event)
dd <- datadist(rdata)
options(datadist = "dd")
options(prType = "html")
fit_fg <- cph(Surv(Tstart, Tstop, status == 1) ~
age + size + ncat + hr_status,
weights = weight.cens,
x = T,
y = T,
surv = T,
data = rdata.w1
)
print(fit_fg)
options(datadist = NULL)
Cox Proportional Hazards Model
cph(formula = Surv(Tstart, Tstop, status == 1) ~ age + size + ncat + hr_status, data = rdata.w1, weights = weight.cens, x = T, y = T, surv = T)
Model Tests | Discrimination Indexes |
|
---|---|---|
Obs 1486 | LR χ2 34.31 | R2 0.032 |
Events 135 | d.f. 4 | Dxy 0.286 |
Center 0.8068 | Pr(>χ2) 0.0000 | g 0.534 |
Score χ2 39.30 | gr 1.705 | |
Pr(>χ2) 0.0000 |
β | S.E. | Wald Z | Pr(>|Z|) | |
---|---|---|---|---|
age | 0.0002 | 0.0122 | 0.02 | 0.9859 |
size | 0.2044 | 0.0576 | 3.55 | 0.0004 |
ncat=positive | 0.4532 | 0.1784 | 2.54 | 0.0111 |
hr_status=ER-/PR- | 0.5860 | 0.1927 | 3.04 | 0.0024 |
Click to expand code
# We also fit the FG model using riskRegression::FGR()
# because it is useful to evaluate some of the prediction performance
# measures later
primary_event <- 1 # primary event is 1 (breast cancer recurrence)
# set 2 to fit a model for non-recurrence mortality.
fit_fgr <- FGR(Hist(time, status_num) ~
age + size +
ncat + hr_status,
cause = primary_event,
data = rdata
)
Summary of the model coefficients using survival
, mstate
and
survival
or rms
, and riskRegression
packages
survival package | mstate + survival/rms packages | riskRegression package | |
---|---|---|---|
age | 0.0002 | 0.0002 | 0.0002 |
size | 0.2044 | 0.2044 | 0.2044 |
ncatpositive | 0.4532 | 0.4532 | 0.4532 |
hr_statusER-/PR- | 0.5860 | 0.5860 | 0.5860 |
The coefficients of the models indicated that higher size, positive nodal status and ER- and/or PR- were associated with higher risk to develop a breast cancer recurrence.
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] webshot_0.5.2 gridExtra_2.3
## [3] rsample_0.1.1 forcats_0.5.1
## [5] stringr_1.4.0 dplyr_1.0.7
## [7] purrr_0.3.4 readr_2.1.1
## [9] tidyr_1.1.4 tibble_3.1.6
## [11] tidyverse_1.3.1 boot_1.3-28
## [13] gtsummary_1.5.0 kableExtra_1.3.4
## [15] knitr_1.36 plotrix_3.8-2
## [17] riskRegression_2021.10.10 pec_2021.10.11
## [19] prodlim_2019.11.13 pseudo_1.4.3
## [21] geepack_1.3.3 KMsurv_0.1-5
## [23] mstate_0.3.2 rms_6.2-0
## [25] SparseM_1.81 Hmisc_4.6-0
## [27] ggplot2_3.3.5 Formula_1.2-4
## [29] lattice_0.20-45 survival_3.2-13
## [31] pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.3.0 systemfonts_1.0.3
## [4] plyr_1.8.6 listenv_0.8.0 TH.data_1.1-0
## [7] digest_0.6.29 foreach_1.5.1 htmltools_0.5.2
## [10] fansi_0.5.0 magrittr_2.0.1 checkmate_2.0.0
## [13] cluster_2.1.2 tzdb_0.2.0 recipes_0.1.17
## [16] globals_0.14.0 modelr_0.1.8 mets_1.2.9
## [19] gower_0.2.2 matrixStats_0.61.0 sandwich_3.0-1
## [22] svglite_2.0.0 jpeg_0.1-9 colorspace_2.0-2
## [25] rvest_1.0.2 haven_2.4.3 xfun_0.28
## [28] crayon_1.4.2 jsonlite_1.7.2 zoo_1.8-9
## [31] iterators_1.0.13 glue_1.5.1 gtable_0.3.0
## [34] ipred_0.9-12 MatrixModels_0.5-0 future.apply_1.8.1
## [37] scales_1.1.1 mvtnorm_1.1-3 DBI_1.1.1
## [40] Rcpp_1.0.7 viridisLite_0.4.0 cmprsk_2.2-10
## [43] htmlTable_2.3.0 foreign_0.8-81 stats4_4.1.2
## [46] lava_1.6.10 htmlwidgets_1.5.4 httr_1.4.2
## [49] RColorBrewer_1.1-2 ellipsis_0.3.2 pkgconfig_2.0.3
## [52] nnet_7.3-16 dbplyr_2.1.1 here_1.0.1
## [55] utf8_1.2.2 caret_6.0-90 tidyselect_1.1.1
## [58] rlang_0.4.12 reshape2_1.4.4 munsell_0.5.0
## [61] cellranger_1.1.0 tools_4.1.2 cli_3.1.0
## [64] generics_0.1.1 broom_0.7.10 evaluate_0.14
## [67] fastmap_1.1.0 yaml_2.2.1 ModelMetrics_1.2.2.2
## [70] fs_1.5.1 timereg_2.0.1 future_1.23.0
## [73] nlme_3.1-153 quantreg_5.86 xml2_1.3.3
## [76] compiler_4.1.2 rstudioapi_0.13 png_0.1-7
## [79] gt_0.3.1 reprex_2.0.1 broom.helpers_1.5.0
## [82] stringi_1.7.6 highr_0.9 Matrix_1.3-4
## [85] vctrs_0.3.8 furrr_0.2.3 pillar_1.6.4
## [88] lifecycle_1.0.1 data.table_1.14.2 conquer_1.2.1
## [91] R6_2.5.1 latticeExtra_0.6-29 parallelly_1.29.0
## [94] codetools_0.2-18 polspline_1.1.19 MASS_7.3-54
## [97] assertthat_0.2.1 rprojroot_2.0.2 withr_2.4.3
## [100] multcomp_1.4-17 parallel_4.1.2 hms_1.1.1
## [103] grid_4.1.2 rpart_4.1-15 timeDate_3043.102
## [106] class_7.3-19 rmarkdown_2.11 pROC_1.18.0
## [109] numDeriv_2016.8-1.1 lubridate_1.8.0 base64enc_0.1-3