Skip to content

Ds/season summary #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 64 commits into from
May 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
55ddeee
initial decreasing forecasters rmd
dsweber2 Apr 10, 2025
3e57aec
geo_pooled
dsweber2 Apr 10, 2025
f3bb95d
the problem is quantile regression
dsweber2 Apr 11, 2025
9e126b9
double population scaling, fixing made it worse
dsweber2 Apr 11, 2025
8afc79f
linear parenthetical, boost trees are fine
dsweber2 Apr 11, 2025
2e52758
moving things around, coefficient inspection
dsweber2 Apr 11, 2025
9cc9f39
format and add as_of plots to fanplots
dshemetov Apr 11, 2025
78387df
exploring the fit data in detail
dsweber2 Apr 11, 2025
05e4dc5
also doing this for covid
dsweber2 Apr 11, 2025
19960bc
reorg
dshemetov Apr 12, 2025
4c85eff
tweaks
dshemetov Apr 12, 2025
e6f2b07
yet more tweaks
dsweber2 Apr 14, 2025
adc5bc3
clearer read-through for others
dsweber2 Apr 14, 2025
2057e79
tests borked b/c curl?
dsweber2 Apr 14, 2025
6d767b7
filter pre 2022, test dependencies
dsweber2 Apr 14, 2025
a2dbebf
enh: add forecasting on diffs (ARI rather than AR)
dshemetov Apr 14, 2025
8b04c04
doc: add some comments
dshemetov Apr 14, 2025
54850b7
growth_rate filtering
dsweber2 Apr 14, 2025
24d03b1
enh: try seasonal windowing on diffs forecaster
dshemetov Apr 14, 2025
7307734
enh: do diffs forecast on flusion data
dshemetov Apr 15, 2025
7688a20
minor fixes
dsweber2 Apr 15, 2025
5c40e73
fix+enh: minor fixes and add 0 intercept to slope calculation
dshemetov Apr 16, 2025
572939e
wip: start season summary
dshemetov Apr 16, 2025
e542fc8
Getting backtest_mode working
dsweber2 Apr 16, 2025
0f4a119
revision summary notebook
dsweber2 Apr 16, 2025
ddab61c
Basic revision summary complete
dsweber2 Apr 18, 2025
3c6de3b
scores mix all forecasters, first covid day problems
dsweber2 Apr 22, 2025
b06f23a
phase definitions and scores
dsweber2 Apr 22, 2025
4967de7
fix: covid generation dates
dshemetov Apr 23, 2025
ddd1fdb
external scores updating, score only shared dates
dsweber2 Apr 25, 2025
80ab82e
hotfix: april 9 data tweaks
dshemetov Apr 15, 2025
f79c22b
doc: make run.R more correct about env vars
dshemetov Apr 15, 2025
69fd4cc
toc, various minor notes
dsweber2 Apr 28, 2025
0c11070
Merge branch 'main' into ds/season-summary
dshemetov Apr 28, 2025
cd70e80
including forecasts, more text
dsweber2 Apr 28, 2025
110c2f5
include first_day_wrong, covid forecasts
dsweber2 Apr 28, 2025
49f8921
order via factor, covid fcsts, ggplotly
dsweber2 Apr 29, 2025
c2ef046
doc: season summary lint and covid updates
dshemetov Apr 29, 2025
48c3b69
doc: first day wrong lints
dshemetov Apr 29, 2025
def1448
doc: big update to template.md, describe our forecaster families
dshemetov Apr 29, 2025
5b8da7c
doc: add some styling to template.md
dshemetov Apr 29, 2025
2455449
doc: minor template lint
dshemetov Apr 29, 2025
bbd73dd
doc: more styling
dshemetov Apr 29, 2025
6eef6a1
doc: even more
dshemetov Apr 29, 2025
71beaf0
doc: lint revision summary
dshemetov Apr 29, 2025
02668e1
latest forecast
dsweber2 Apr 29, 2025
f6822d0
latest fcst needs -1 ahead not present
dsweber2 Apr 30, 2025
ffd1c9a
adding latest to flu
dsweber2 Apr 30, 2025
93d9188
`latest` results, takeaways
dsweber2 Apr 30, 2025
89ffe08
doc+lint: add future work section, caveat diffed tests, improve forec…
dshemetov May 8, 2025
dc917a2
wip: exploration summary
dshemetov May 8, 2025
7d7ab7b
Merge branch 'main' into ds/season-summary
dshemetov May 8, 2025
3799554
doc: template
dshemetov May 8, 2025
0d9b6bd
fix: remove debug from overall notebook
dshemetov May 8, 2025
dae68be
doc: exploration summary
dshemetov May 8, 2025
c961f8d
repo: renv
dshemetov May 8, 2025
fa6bbab
lint: remove priority target args, as they're deprecated
dshemetov May 8, 2025
e065f8d
repo: renv
dshemetov May 8, 2025
3386797
fix: a very complicated way to fix a bug
dshemetov May 9, 2025
2c0768b
lint: minor idiom tweak
dshemetov May 9, 2025
bf55134
feat: simplify daily_to_weekly_archive
dshemetov May 10, 2025
8ba796e
enh: add summary reports to makefile
dshemetov May 10, 2025
e9c5f3b
fix: css
dshemetov May 10, 2025
a8b70b0
f
dshemetov May 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ test:
run:
Rscript scripts/run.R

run-nohup:
nohup Rscript scripts/run.R &

run-nohup-restarting:
scripts/hardRestarting.sh &

prod-covid:
export TAR_RUN_PROJECT=covid_hosp_prod; Rscript scripts/run.R

Expand Down Expand Up @@ -65,12 +71,6 @@ get-nwss:
python nwss_covid_export.py; \
python nwss_influenza_export.py

run-nohup:
nohup Rscript scripts/run.R &

run-nohup-restarting:
scripts/hardRestarting.sh &

sync:
Rscript -e "source('R/sync_aws.R'); sync_aws()"

Expand Down Expand Up @@ -98,3 +98,6 @@ get-flu-prod-errors:

get-covid-prod-errors:
Rscript -e "suppressPackageStartupMessages(source(here::here('R', 'load_all.R'))); get_targets_errors(project = 'covid_hosp_prod')"

summary_reports:
Rscript scripts/summary_reports.R
71 changes: 15 additions & 56 deletions R/aux_data_utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -213,13 +213,15 @@ daily_to_weekly <- function(epi_df, agg_method = c("sum", "mean"), keys = "geo_v
#' @param epi_arch the archive to aggregate.
#' @param agg_columns the columns to aggregate.
#' @param agg_method the method to use to aggregate the data, one of "sum" or "mean".
#' @param day_of_week the day of the week to use as the reference day.
#' @param day_of_week_end the day of the week to use as the end of the week.
#' @param week_reference the day of the week to use as the reference day (Wednesday is default).
#' Note that this is 1-indexed, so 1 = Sunday, 2 = Monday, ..., 7 = Saturday.
#' @param week_start the day of the week to use as the start of the week (Sunday is default).
#' Note that this is 1-indexed, so 1 = Sunday, 2 = Monday, ..., 7 = Saturday.
daily_to_weekly_archive <- function(epi_arch,
agg_columns,
agg_method = c("sum", "mean"),
day_of_week = 4L,
day_of_week_end = 7L) {
week_reference = 4L,
week_start = 7L) {
# How to aggregate the windowed data.
agg_method <- arg_match(agg_method)
# The columns we will later group by when aggregating.
Expand All @@ -230,67 +232,24 @@ daily_to_weekly_archive <- function(epi_arch,
sort()
# Choose a fast function to use to slide and aggregate.
if (agg_method == "sum") {
slide_fun <- epi_slide_sum
# If the week is complete, this is equivalent to the sum. If the week is not
# complete, this is equivalent to 7/(number of days in the week) * the sum,
# which should be a decent approximation.
agg_fun <- \(x) 7 * mean(x, na.rm = TRUE)
} else if (agg_method == "mean") {
slide_fun <- epi_slide_mean
agg_fun <- \(x) mean(x, na.rm = TRUE)
}
# Slide over the versions and aggregate.
epix_slide(
epi_arch,
.versions = ref_time_values,
function(x, group_keys, ref_time) {
# The last day of the week we will slide over.
ref_time_last_week_end <- floor_date(ref_time, "week", day_of_week_end - 1)

# To find the days we will slide over, we need to find the first and last
# complete weeks of data. Get the max and min times, and then find the
# first and last complete weeks of data.
min_time <- min(x$time_value)
max_time <- max(x$time_value)

# Let's determine if the min and max times are in the same week.
ceil_min_time <- ceiling_date(min_time, "week", week_start = day_of_week_end - 1)
ceil_max_time <- ceiling_date(max_time, "week", week_start = day_of_week_end - 1)

# If they're not in the same week, this means we have at least one
# complete week of data to slide over.
if (ceil_min_time < ceil_max_time) {
valid_slide_days <- seq.Date(
from = ceiling_date(min_time, "week", week_start = day_of_week_end - 1),
to = floor_date(max_time, "week", week_start = day_of_week_end - 1),
by = 7L
)
} else {
# This is the degenerate case, where we have about 1 week or less of
# data. In this case, we opt to return nothing for two reasons:
# 1. in most cases here, the data is incomplete for a single week,
# 2. if the data is complete, a single week of data is not enough to
# reasonably perform any kind of aggregation.
return(tibble())
}

# If the last day of the week is not the end of the week, add it to the
# list of valid slide days (this will produce an incomplete slide, but
# that's fine for us, since it should only be 1 day, historically.)
if (wday(max_time) != day_of_week_end) {
valid_slide_days <- c(valid_slide_days, max_time)
}

# Slide over the days and aggregate.
x %>%
group_by(across(all_of(keys))) %>%
slide_fun(
agg_columns,
.window_size = 7L,
na.rm = TRUE,
.ref_time_values = valid_slide_days
) %>%
select(-all_of(agg_columns)) %>%
rename_with(~ gsub("slide_value_", "", .x)) %>%
rename_with(~ gsub("_7dsum", "", .x)) %>%
# Round all dates to reference day of the week. These will get
# de-duplicated by compactify in as_epi_archive below.
mutate(time_value = round_date(time_value, "week", day_of_week - 1)) %>%
mutate(week_start = ceiling_date(time_value, "week", week_start = week_start)-1) %>%
summarize(across(all_of(agg_columns), agg_fun), .by = all_of(c(keys, "week_start"))) %>%
mutate(time_value = round_date(week_start, "week", week_reference - 1)) %>%
select(-week_start) %>%
as_tibble()
}
) %>%
Expand Down
17 changes: 0 additions & 17 deletions R/forecasters/data_validation.R
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,6 @@ confirm_sufficient_data <- function(epi_data, ahead, args_input, outcome, extra_
# TODO: Buffer should probably be 2 * n(lags) * n(predictors). But honestly,
# this needs to be fixed in epipredict itself, see
# https://github.com/cmu-delphi/epipredict/issues/106.
if (identical(extra_sources, "")) {
extra_sources <- character(0L)
}
has_no_last_nas <- epi_data %>%
drop_na(c(!!outcome, !!!extra_sources)) %>%
group_by(geo_value) %>%
Expand Down Expand Up @@ -106,17 +103,3 @@ filter_minus_one_ahead <- function(epi_data, ahead) {
}
epi_data
}

#' Unwrap an argument if it's a list of length 1
#'
#' Many of our arguments to the forecasters come as lists not because we expect
#' them that way, but as a byproduct of tibble and expand_grid.
unwrap_argument <- function(arg, default_trigger = "", default = character(0L)) {
if (is.list(arg) && length(arg) == 1) {
arg <- arg[[1]]
}
if (identical(arg, default_trigger)) {
return(default)
}
return(arg)
}
2 changes: 1 addition & 1 deletion R/forecasters/ensemble_average.R
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
ensemble_average <- function(epi_data,
forecasts,
outcome,
extra_sources = "",
extra_sources = character(),
ensemble_args = list(),
ensemble_args_names = NULL) {
# unique parameters must be buried in ensemble_args so that the generic function signature is stable
Expand Down
5 changes: 2 additions & 3 deletions R/forecasters/forecaster_climatological.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#'
climate_linear_ensembled <- function(epi_data,
outcome,
extra_sources = "",
extra_sources = character(),
ahead = 7,
trainer = parsnip::linear_reg(),
quantile_levels = covidhub_probs(),
Expand All @@ -22,8 +22,7 @@ climate_linear_ensembled <- function(epi_data,
nonlin_method <- arg_match(nonlin_method)

epi_data <- validate_epi_data(epi_data)
extra_sources <- unwrap_argument(extra_sources)
trainer <- unwrap_argument(trainer)
extra_sources <- unlist(extra_sources)

args_list <- list(...)
ahead <- as.integer(ahead / 7)
Expand Down
5 changes: 2 additions & 3 deletions R/forecasters/forecaster_flatline.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,15 @@
#' @export
flatline_fc <- function(epi_data,
outcome,
extra_sources = "",
extra_sources = character(),
ahead = 1,
trainer = parsnip::linear_reg(),
quantile_levels = covidhub_probs(),
filter_source = "",
filter_agg_level = "",
...) {
epi_data <- validate_epi_data(epi_data)
extra_sources <- unwrap_argument(extra_sources)
trainer <- unwrap_argument(trainer)
extra_sources <- unlist(extra_sources)

# perform any preprocessing not supported by epipredict
epi_data %<>% filter_extraneous(filter_source, filter_agg_level)
Expand Down
5 changes: 2 additions & 3 deletions R/forecasters/forecaster_flusion.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
flusion <- function(epi_data,
outcome,
extra_sources = "",
extra_sources = character(),
ahead = 7,
pop_scaling = FALSE,
trainer = rand_forest(
Expand All @@ -24,8 +24,7 @@ flusion <- function(epi_data,
derivative_estimator <- arg_match(derivative_estimator)

epi_data <- validate_epi_data(epi_data)
extra_sources <- unwrap_argument(extra_sources)
trainer <- unwrap_argument(trainer)
extra_sources <- unlist(extra_sources)

# perform any preprocessing not supported by epipredict
args_input <- list(...)
Expand Down
5 changes: 2 additions & 3 deletions R/forecasters/forecaster_no_recent_outcome.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#' it may whiten any old data as the outcome
no_recent_outcome <- function(epi_data,
outcome,
extra_sources = "",
extra_sources = character(),
ahead = 7,
pop_scaling = FALSE,
trainer = epipredict::quantile_reg(),
Expand All @@ -24,8 +24,7 @@ no_recent_outcome <- function(epi_data,
week_method <- arg_match(week_method)

epi_data <- validate_epi_data(epi_data)
extra_sources <- unwrap_argument(extra_sources)
trainer <- unwrap_argument(trainer)
extra_sources <- unlist(extra_sources)

# this is for the case where there are multiple sources in the same column
epi_data %<>% filter_extraneous(filter_source, filter_agg_level)
Expand Down
5 changes: 2 additions & 3 deletions R/forecasters/forecaster_scaled_pop.R
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
#' @export
scaled_pop <- function(epi_data,
outcome,
extra_sources = "",
extra_sources = character(),
ahead = 1,
pop_scaling = TRUE,
drop_non_seasons = FALSE,
Expand All @@ -64,8 +64,7 @@ scaled_pop <- function(epi_data,
nonlin_method <- arg_match(nonlin_method)

epi_data <- validate_epi_data(epi_data)
extra_sources <- unwrap_argument(extra_sources)
trainer <- unwrap_argument(trainer)
extra_sources <- unlist(extra_sources)

# perform any preprocessing not supported by epipredict
#
Expand Down
10 changes: 3 additions & 7 deletions R/forecasters/forecaster_scaled_pop_seasonal.R
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
scaled_pop_seasonal <- function(
epi_data,
outcome,
extra_sources = "",
extra_sources = character(),
ahead = 1,
pop_scaling = TRUE,
drop_non_seasons = FALSE,
Expand All @@ -61,13 +61,9 @@ scaled_pop_seasonal <- function(
nonlin_method <- arg_match(nonlin_method)

epi_data <- validate_epi_data(epi_data)
extra_sources <- unwrap_argument(extra_sources)
trainer <- unwrap_argument(trainer)
extra_sources <- unlist(extra_sources)

if (typeof(seasonal_method) == "list") {
seasonal_method <- seasonal_method[[1]]
}
if (all(seasonal_method == c("none", "flu", "covid", "indicator", "window", "climatological"))) {
if (identical(seasonal_method, c("none", "flu", "covid", "indicator", "window", "climatological"))) {
seasonal_method <- "none"
}
# perform any preprocessing not supported by epipredict
Expand Down
5 changes: 2 additions & 3 deletions R/forecasters/forecaster_smoothed_scaled.R
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
#' @export
smoothed_scaled <- function(epi_data,
outcome,
extra_sources = "",
extra_sources = character(),
ahead = 1,
pop_scaling = TRUE,
trainer = parsnip::linear_reg(),
Expand All @@ -73,8 +73,7 @@ smoothed_scaled <- function(epi_data,
nonlin_method <- arg_match(nonlin_method)

epi_data <- validate_epi_data(epi_data)
extra_sources <- unwrap_argument(extra_sources)
trainer <- unwrap_argument(trainer)
extra_sources <- unlist(extra_sources)

# perform any preprocessing not supported by epipredict
#
Expand Down
25 changes: 14 additions & 11 deletions R/forecasters/formatters.R
Original file line number Diff line number Diff line change
Expand Up @@ -72,24 +72,27 @@ format_flusight <- function(pred, disease = c("flu", "covid")) {
}

format_scoring_utils <- function(forecasts_and_ensembles, disease = c("flu", "covid")) {
forecasts_and_ensembles %>%
filter(!grepl("region.*", geo_value)) %>%
mutate(
reference_date = get_forecast_reference_date(forecast_date),
target = glue::glue("wk inc {disease} hosp"),
horizon = as.integer(floor((target_end_date - reference_date) / 7)),
output_type = "quantile",
output_type_id = quantile,
value = value
) %>%
# dplyr here was unreasonably slow on 1m+ rows, so replacing with direct access
fc_ens <- forecasts_and_ensembles
fc_ens <- fc_ens[!grepl("region.*", forecasts_and_ensembles$geo_value), ]
fc_ens[, "reference_date"] <- get_forecast_reference_date(fc_ens$forecast_date)
fc_ens[, "target"] <- glue::glue("wk inc {disease} hosp")
fc_ens[, "horizon"] <- as.integer(floor((fc_ens$target_end_date - fc_ens$reference_date) / 7))
fc_ens[, "output_type"] <- "quantile"
fc_ens[, "output_type_id"] <- fc_ens$quantile
fc_ens %>%
left_join(
get_population_data() %>%
select(state_id, state_code),
by = c("geo_value" = "state_id")
) %>%
rename(location = state_code, model_id = forecaster) %>%
select(reference_date, target, horizon, target_end_date, location, output_type, output_type_id, value, model_id) %>%
drop_na()
drop_na() %>%
arrange(location, target_end_date, reference_date, output_type_id) %>%
group_by(model_id, location, target_end_date, reference_date) %>%
mutate(value = sort(value)) %>%
ungroup()
}

#' The quantile levels used by the covidhub repository
Expand Down
3 changes: 3 additions & 0 deletions R/imports.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,15 @@ library(crew)
library(data.table)
library(dplyr)
library(DT)
options(DT.options = list(scrollX = TRUE))
library(epidatr)
library(epipredict)
library(epiprocess)
library(ggplot2)
library(glue)
library(grf)
library(here)
library(httpgd)
if (Sys.getenv("COVID_SUBMISSION_DIRECTORY", "cache") != "cache") {
library(hubValidations)
}
Expand All @@ -36,6 +38,7 @@ library(recipes)
library(renv)
library(rlang)
library(rspm)
library(scales)
library(scoringutils)
library(slider)
library(stringr)
Expand Down
Loading