Skip to content

Commit 6812619

Browse files
author
Ondrej Slama
authored
Add Best PC and update vignette (#4)
2 parents b90f8b9 + 8d50526 commit 6812619

File tree

6 files changed

+93
-59
lines changed

6 files changed

+93
-59
lines changed

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Package: stats4phc
22
Title: Performance Evaluation for the Prognostic Value of Predictive Models Intended to Support
33
Personalized Healthcare Through Predictiveness Curves and Positive / Negative Predictive Values
4-
Version: 0.1
4+
Version: 0.1.1
55
Authors@R: c(
66
person("Ondrej", "Slama", email = "[email protected]", role = c("aut", "cre")),
77
person("Darrick", "Shen", email = "[email protected]", role = "aut"),

R/PV.R

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,12 +116,13 @@ predictionColours <- function(x, show.best) {
116116
"NPV" = "#53B400",
117117
"Best PPV" = "gray65",
118118
"PPV" = "red",
119+
"Best PC" = "black",
119120
"PC" = "plum",
120121
"1-NPV" = "royalblue2",
121122
"Best 1-NPV" = "gray50"
122123
)
123124
if (show.best) {
124-
x <- c(x, paste("Best", setdiff(x, "PC")))
125+
x <- c(x, paste("Best", x))
125126
}
126127
return(clrs[names(clrs) %in% x]) # keep the ordering above
127128
}

R/riskProfile.R

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -386,12 +386,13 @@ riskProfile <- function(outcome,
386386
distinct() %>%
387387
arrange(.data$percentile, .data$score) %>%
388388
mutate(
389+
PC = ifelse(.data$percentile <= 1 - prev, 0, 1),
389390
PPV = bestPPV(perc = .data$percentile, prev = prev),
390391
`1-NPV` = bestMNPV(perc = .data$percentile, prev = prev),
391392
NPV = 1 - .data$`1-NPV`
392393
) %>%
393394
tidyr::pivot_longer(
394-
cols = c("PPV", "1-NPV", "NPV"), names_to = "pv", values_to = "pvValue"
395+
cols = c("PC", "PPV", "1-NPV", "NPV"), names_to = "pv", values_to = "pvValue"
395396
) %>%
396397
filter(.data$pv %in% include) %>%
397398
mutate(
@@ -404,10 +405,10 @@ riskProfile <- function(outcome,
404405
geom_line(
405406
data = best,
406407
aes(x = .data[[xvar]], y = .data$pvValue, linewidth = .data$pv),
407-
colour = "gray70"
408+
colour = "gray60"
408409
) +
409410
scale_linewidth_manual(
410-
values = c("Best 1-NPV" = 0.3, "Best PPV" = 0.3, "Best NPV" = 0.3),
411+
values = c("Best 1-NPV" = 0.3, "Best PPV" = 0.3, "Best PC" = 0.3, "Best NPV" = 0.3),
411412
name = "Best PVs"
412413
)
413414
} else {

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ remotes::install_github(repo = "genentech/stats4phc")
3030
For reproducibility, refer to a specific version tag, for example
3131

3232
``` r
33-
remotes::install_github(repo = "genentech/stats4phc", ref = "v0.1")
33+
remotes::install_github(repo = "genentech/stats4phc", ref = "v0.1.1")
3434
```
3535

3636

man/stats4phc-package.Rd

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

vignettes/stats4phc.Rmd

Lines changed: 84 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -21,77 +21,105 @@ This package provides functions for performance evaluation for the prognostic va
2121
- Calibration
2222
- Sensitivity and specificity
2323

24-
To begin with, let's align on terminology.
24+
The intention is not to replace standard metrics evaluation (like Brier score, log-loss, or AUC).
25+
On the contrary, the mentioned quantities should be checked in addition in order to get the overall sense of model behaviour.
26+
2527

2628

2729
## Terminology
2830

31+
To begin with, let's align on terminology.
2932
Below we define the terms that will be used across the article:
3033

3134
- outcome: the true observation of the quantity of interest;
3235

33-
- score:
36+
- score / risk:
3437

3538
- either a raw value (e.g. a biomarker) for the purpose of measuring (or approximating) the outcome,
3639

37-
- or a prediction score given by a predictive model, where the outcome was modeled as the response;
40+
- or a probability given by a predictive model, where the outcome was modeled as the response;
3841

3942
- estimate: output of a statistical methodology, where score is used as independent variable and outcome as a dependent variable.
4043

4144

45+
Let's now load the package and the example data.
46+
47+
```{r}
48+
library(stats4phc)
49+
50+
# Read in example data
51+
auroc <- read.csv(system.file("extdata", "sample.csv", package = "stats4phc"))
52+
rscore <- auroc$predicted_calibrated # vector of already calibrated model probabilities
53+
truth <- as.numeric(auroc$actual) # vector of 0s or 1s
54+
```
55+
56+
4257
# Predictiveness curves
4358

4459
Predictiveness curves are an insightful visualization to assess the inherent ability of prognostic models to provide predictions to individual patients. Cumulative versions of predictiveness curves represent positive predictive values (PPV) and 1 - negative predictive values (1 - NPV) and are also informative if the eventual goal is to use a cutoff for clinical decision making.
4560

4661
You can use `riskProfile` function to visualize and assess all these quantities.
4762

48-
Here is an example:
63+
Note that method "asis" (below on the graphs) means that the score (or model probabilities in our case) are taken as is,
64+
i.e. there is no estimation or smoothing.
4965

50-
```{r}
51-
library(stats4phc)
5266

53-
# Read in example data
54-
auroc <- read.csv(system.file("extdata", "sample.csv", package = "stats4phc"))
55-
rscore <- auroc$predicted_calibrated
56-
truth <- as.numeric(auroc$actual)
57-
```
67+
## Predictiveness curve
68+
69+
Let's start with predictiveness curve:
5870

5971
```{r fig.height=5, fig.width=8}
60-
p <- riskProfile(outcome = truth, score = rscore)
72+
p <- riskProfile(outcome = truth, score = rscore, include = "PC")
6173
p$plot
6274
```
6375

64-
```{r}
65-
head(p$data)
76+
Ideally, all subjects in the population that have the condition (=> prevalence) are marked as having the condition (predicted risk = 1) and all subjects without the condition (=> 1 - prevalence) are marked as not having the condition (predicted risk = 0).
77+
This implies that the ideal predictiveness curve is 0 for all subjects not having the condition,
78+
and then it steps (jumps) at `1 - prevalence` to 1 for all the subjects having the condition (see the gray line).
79+
80+
In reality, the curves are not step functions. The more flat the curves get, the less discrimination, and therefore utility, there is in the model.
81+
82+
One can also investigate the tails of predictiveness curve. The model is more useful if these regions have very low or very high predicted risks relatively to the rest of the data.
83+
84+
85+
## Positive / Negative predictive values
86+
87+
Now let's plot PPV and 1-NPV:
88+
89+
```{r fig.height=5, fig.width=8}
90+
p <- riskProfile(outcome = truth, score = rscore, include = c("PPV", "1-NPV"))
91+
p$plot
6692
```
6793

68-
There is an extensive documentation of this function with examples if you run `?riskProfile`.
94+
Again, in an ideal case, they both are as close to the gray lines as possible.
6995

70-
To briefly highlight the functionalities:
96+
In an ideal scenario:
7197

72-
- Use the `methods` argument to specify the desired estimation method (see the last section) or use `"asis"` for no estimation.
98+
- in terms of PPV: If all the subjects with the condition are predicted perfectly, then PPV = TP / PP = 1 (TP = true positive, PP = predicted positive).
99+
Hence, all the subjects with the condition must be higher than `1 - prevalence` on the prediction percentile for PPV = 1.
73100

74-
- You can adjust the prevalence by setting `prev.adj` to the desired amount.
101+
- in terms of 1-NPV: If all the subjects without the condition are predicted perfectly, then NPV = TN / PN = 1 (TN = true negative, PN = predicted negative).
102+
Hence, all the subjects without the condition must be lower than `1 - prevalence` on the risk percentile for 1-NPV = 0.
75103

76-
- `show.nonparam.pv` controls whether to show/hide the non-parametric estimation of PPV, 1-NPV, and NPV.
77104

78-
- `show.best.pv` controls whether to show/hide the theoretically best possible PPV, 1-NPV, NPV.
105+
## Output settings
79106

80-
- Use `include` argument to specify what quantities to show:
107+
Note that:
81108

82-
- PC = predictiveness curve
109+
- most importantly, in case of a biomarker or if the model probabilities are not calibrated well, you can use a smoother, see `methods` argument and the last section of the vignette.
83110

84-
- PPV = positive predictive value
111+
- the prevalence can be adjusted by setting it in `prev.adj`.
85112

86-
- NPV = negative predictive value
113+
- you can also plot "NPV" by adjusting the `include` parameter.
87114

88-
- 1-NPV = 1 - negative predictive value
115+
- you can also access the underlying data:
89116

90-
- `plot.raw` sets whether to plot raw values or percentiles.
117+
```{r}
118+
p <- riskProfile(outcome = truth, score = rscore, include = c("PPV", "1-NPV"))
119+
head(p$data)
120+
```
91121

92-
- `rev.order` sets whether to reverse the order of the score (useful if higher score refers to lower outcome).
93122

94-
- The output is the plot itself and the underlying data.
95123

96124
# Calibration
97125

@@ -104,37 +132,41 @@ p <- calibrationProfile(outcome = truth, score = rscore)
104132
p$plot
105133
```
106134

107-
```{r}
108-
head(p$data)
109-
```
135+
In an ideal scenario, the fitted curves should be identical with the identity line.
136+
137+
In reality, the closer they are to the identity line, the better.
138+
139+
Note that you can also quantify calibration through discrimination and miscalibration index,
140+
see this [blog post](https://stats4datascience.com/posts/three_metrics_binary/)
141+
and [modsculpt](https://github.com/Genentech/modsculpt)
142+
R package (metrics functions).
143+
110144

111-
There is an extensive documentation of this function with examples if you run `?calibrationProfile`.
145+
## Output settings
112146

113-
To briefly highlight the functionalities:
147+
Note that:
114148

115-
- Use the `methods` argument to specify the desired estimation method (see the last section). In this case, `"asis"` is not allowed.
149+
- in case of a biomarker or if the model probabilities are not calibrated well, you can use a smoother, see `methods` argument and the last section of the vignette. In this case, `"asis"` is not allowed.
116150

117-
- Use `include` argument to specify what additional quantities to show:
151+
- use `include` argument to specify what additional quantities to show:
118152

119-
- `"loess"`: Adds non-parametric Loess fit.
153+
- `"loess"`: Adds non-parametric Loess fit.
120154

121-
- `"citl"`: Adds "Calibration in the Large", an overall mean of outcome and score.
155+
- `"citl"`: Adds "Calibration in the Large", an overall mean of outcome and score.
122156

123-
- `"rug"`: Adds "rug", i.e. ticks on x-axis showing the individual data points (top axis shows score for outcome == 1, bottom axis shows score for outcome == 0).
157+
- `"rug"`: Adds "rug", i.e. ticks on x-axis showing the individual data points (top axis shows score for outcome == 1, bottom axis shows score for outcome == 0).
124158

125-
- `"datapoints"`: Similar to rug, just shows jittered points instead of ticks.
159+
- `"datapoints"`: Similar to rug, just shows jittered points instead of ticks.
126160

127-
- `plot.raw` sets whether to plot raw values or percentiles.
161+
- use `margin.type` to add a marginal plot through `ggExtra::ggMarginal`. You can select one of `c("density", "histogram", "boxplot", "violin", "densigram")`. It adds the selected 1d graph on top of the calibrtion plot and is suitable for investigating the score.
128162

129-
- `rev.order` sets whether to reverse the order of the score (useful if higher score refers to lower outcome).
163+
- you can again access the underlying data with `p$data`.
130164

131-
- Use `margin.type` to add a marginal plot through `ggExtra::ggMarginal`. You can select one of `c("density", "histogram", "boxplot", "violin", "densigram")`. It adds the selected 1d graph on top of the calibrtion plot and is suitable for investigating the score.
132165

133-
- The output is the plot itself and the underlying data.
134166

135167
# Sensitivity and specificity
136168

137-
Ultimately, we provide a sensitivity and specificity plot. For these quantities you need to define a cutoff with which you can trasnform the numeric score to binary. We use data-driven cutoffs, meaning that every single value of score is taken as the cutoff, allowing us to visualize the sensitivity and specificity as a function of score. This graph may inform you of the best suitable cutoff for your model, although we usually recommend to output the whole score range, not just the binary decisions.
169+
Ultimately, we provide a sensitivity and specificity plot as a function of score (threshold is data-driven). This graph may inform you of the best suitable cutoff for your model, although we usually recommend to output the whole score range, not just the binary decisions.
138170

139171
You can use `sensSpec` function to visualize and assess sensitivity and specificity.
140172

@@ -143,21 +175,19 @@ p <- sensSpec(outcome = truth, score = rscore)
143175
p$plot
144176
```
145177

146-
```{r}
147-
head(p$data)
148-
```
178+
Again, the ideal scenario would be having a model following the gray lines.
179+
Since there is a trade-off between sensitivity and specificity, the graph may guide you which threshold (or thresholds) to choose, depending if one is more important than the other.
149180

150-
There is an extensive documentation of this function with examples if you run `?sensSpec`.
151181

152-
To briefly highlight the functionalities:
182+
## Output settings
153183

154-
- Use the `methods` argument to specify the desired estimation method (see the last section) or use `"asis"` for no estimation.
184+
Note that:
155185

156-
- `show.best` controls whether to show/hide the theoretically best possible sensitivity and specificity.
186+
- in case of a biomarker or if the model probabilities are not calibrated well, you can use a smoother, see `methods` argument and the last section of the vignette.
187+
188+
- you can again access the underlying data with `p$data`.
157189

158-
- `plot.raw` sets whether to plot raw values or percentiles.
159190

160-
- `rev.order` sets whether to reverse the order of the score (useful if higher score refers to lower outcome).
161191

162192
# Adjusting the graphs
163193

@@ -188,6 +218,7 @@ p$plot +
188218

189219
Otherwise, you can use the `$data` element to construct your own graph as well.
190220

221+
191222
# Estimations in stats4phc
192223

193224
For all the plotting functions from this package, there is a possibility to define an estimation function, which will be applied on the given score. In `calibrationProfile`, this serves as a calibration curve. In `riskProfile`, this smooths the given score. All of this is always driven by the `methods` argument, which is available in each of the plotting functions.

0 commit comments

Comments
 (0)