Skip to content

Ds/season summary #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
55ddeee
initial decreasing forecasters rmd
dsweber2 Apr 10, 2025
3e57aec
geo_pooled
dsweber2 Apr 10, 2025
f3bb95d
the problem is quantile regression
dsweber2 Apr 11, 2025
9e126b9
double population scaling, fixing made it worse
dsweber2 Apr 11, 2025
8afc79f
linear parenthetical, boost trees are fine
dsweber2 Apr 11, 2025
2e52758
moving things around, coefficient inspection
dsweber2 Apr 11, 2025
9cc9f39
format and add as_of plots to fanplots
dshemetov Apr 11, 2025
78387df
exploring the fit data in detail
dsweber2 Apr 11, 2025
05e4dc5
also doing this for covid
dsweber2 Apr 11, 2025
19960bc
reorg
dshemetov Apr 12, 2025
4c85eff
tweaks
dshemetov Apr 12, 2025
e6f2b07
yet more tweaks
dsweber2 Apr 14, 2025
adc5bc3
clearer read-through for others
dsweber2 Apr 14, 2025
2057e79
tests borked b/c curl?
dsweber2 Apr 14, 2025
6d767b7
filter pre 2022, test dependencies
dsweber2 Apr 14, 2025
a2dbebf
enh: add forecasting on diffs (ARI rather than AR)
dshemetov Apr 14, 2025
8b04c04
doc: add some comments
dshemetov Apr 14, 2025
54850b7
growth_rate filtering
dsweber2 Apr 14, 2025
24d03b1
enh: try seasonal windowing on diffs forecaster
dshemetov Apr 14, 2025
7307734
enh: do diffs forecast on flusion data
dshemetov Apr 15, 2025
7688a20
minor fixes
dsweber2 Apr 15, 2025
5c40e73
fix+enh: minor fixes and add 0 intercept to slope calculation
dshemetov Apr 16, 2025
572939e
wip: start season summary
dshemetov Apr 16, 2025
e542fc8
Getting backtest_mode working
dsweber2 Apr 16, 2025
0f4a119
revision summary notebook
dsweber2 Apr 16, 2025
ddab61c
Basic revision summary complete
dsweber2 Apr 18, 2025
3c6de3b
scores mix all forecasters, first covid day problems
dsweber2 Apr 22, 2025
b06f23a
phase definitions and scores
dsweber2 Apr 22, 2025
4967de7
fix: covid generation dates
dshemetov Apr 23, 2025
ddd1fdb
external scores updating, score only shared dates
dsweber2 Apr 25, 2025
80ab82e
hotfix: april 9 data tweaks
dshemetov Apr 15, 2025
f79c22b
doc: make run.R more correct about env vars
dshemetov Apr 15, 2025
69fd4cc
toc, various minor notes
dsweber2 Apr 28, 2025
0c11070
Merge branch 'main' into ds/season-summary
dshemetov Apr 28, 2025
cd70e80
including forecasts, more text
dsweber2 Apr 28, 2025
110c2f5
include first_day_wrong, covid forecasts
dsweber2 Apr 28, 2025
49f8921
order via factor, covid fcsts, ggplotly
dsweber2 Apr 29, 2025
c2ef046
doc: season summary lint and covid updates
dshemetov Apr 29, 2025
48c3b69
doc: first day wrong lints
dshemetov Apr 29, 2025
def1448
doc: big update to template.md, describe our forecaster families
dshemetov Apr 29, 2025
5b8da7c
doc: add some styling to template.md
dshemetov Apr 29, 2025
2455449
doc: minor template lint
dshemetov Apr 29, 2025
bbd73dd
doc: more styling
dshemetov Apr 29, 2025
6eef6a1
doc: even more
dshemetov Apr 29, 2025
71beaf0
doc: lint revision summary
dshemetov Apr 29, 2025
02668e1
latest forecast
dsweber2 Apr 29, 2025
f6822d0
latest fcst needs -1 ahead not present
dsweber2 Apr 30, 2025
ffd1c9a
adding latest to flu
dsweber2 Apr 30, 2025
93d9188
`latest` results, takeaways
dsweber2 Apr 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions R/forecasters/data_validation.R
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,16 @@ filter_extraneous <- function(epi_data, filter_source, filter_agg_level) {
return(epi_data)
}

#' the minus one ahead causes problems for `quantile_regression` if that data is
#' actually present, so we should filter it out
filter_minus_one_ahead <- function(epi_data, ahead) {
if (ahead < 0) {
dont_include <- attr(epi_data, "metadata")$as_of + ahead
epi_data %<>% filter(time_value < dont_include)
}
epi_data
}

#' Unwrap an argument if it's a list of length 1
#'
#' Many of our arguments to the forecasters come as lists not because we expect
Expand Down
2 changes: 2 additions & 0 deletions R/forecasters/forecaster_scaled_pop_seasonal.R
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ scaled_pop_seasonal <- function(
epi_data %<>% filter_extraneous(filter_source, filter_agg_level)
# this is a temp fix until a real fix gets put into epipredict
epi_data <- clear_lastminute_nas(epi_data, cols = c(outcome, extra_sources))
# predicting the -1 ahead when it is present sometimes lead to freezeing
epi_data %<>% filter_minus_one_ahead(ahead)
# this next part is basically unavoidable boilerplate you'll want to copy
args_input <- list(...)
# edge case where there is no data or less data than the lags; eventually epipredict will handle this
Expand Down
25 changes: 14 additions & 11 deletions R/forecasters/formatters.R
Original file line number Diff line number Diff line change
Expand Up @@ -72,24 +72,27 @@ format_flusight <- function(pred, disease = c("flu", "covid")) {
}

format_scoring_utils <- function(forecasts_and_ensembles, disease = c("flu", "covid")) {
forecasts_and_ensembles %>%
filter(!grepl("region.*", geo_value)) %>%
mutate(
reference_date = get_forecast_reference_date(forecast_date),
target = glue::glue("wk inc {disease} hosp"),
horizon = as.integer(floor((target_end_date - reference_date) / 7)),
output_type = "quantile",
output_type_id = quantile,
value = value
) %>%
# dplyr here was unreasonably slow on 1m+ rows, so replacing with direct access
fc_ens <- forecasts_and_ensembles
fc_ens <- fc_ens[!grepl("region.*", forecasts_and_ensembles$geo_value), ]
fc_ens[, "reference_date"] <- get_forecast_reference_date(fc_ens$forecast_date)
fc_ens[, "target"] <- glue::glue("wk inc {disease} hosp")
fc_ens[, "horizon"] <- as.integer(floor((fc_ens$target_end_date - fc_ens$reference_date) / 7))
fc_ens[, "output_type"] <- "quantile"
fc_ens[, "output_type_id"] <- fc_ens$quantile
fc_ens %>%
left_join(
get_population_data() %>%
select(state_id, state_code),
by = c("geo_value" = "state_id")
) %>%
rename(location = state_code, model_id = forecaster) %>%
select(reference_date, target, horizon, target_end_date, location, output_type, output_type_id, value, model_id) %>%
drop_na()
drop_na() %>%
arrange(location, target_end_date, reference_date, output_type_id) %>%
group_by(model_id, location, target_end_date, reference_date) %>%
mutate(value = sort(value)) %>%
ungroup()
}

#' The quantile levels used by the covidhub repository
Expand Down
7 changes: 3 additions & 4 deletions R/targets/score_targets.R
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ get_external_forecasts <- function(external_object_name) {
select(forecaster, geo_value, forecast_date, target_end_date, quantile, value)
}

score_forecasts <- function(nhsn_latest_data, joined_forecasts_and_ensembles) {
score_forecasts <- function(nhsn_latest_data, joined_forecasts_and_ensembles, disease) {
truth_data <-
nhsn_latest_data %>%
select(geo_value, target_end_date = time_value, oracle_value = value) %>%
Expand All @@ -33,9 +33,8 @@ score_forecasts <- function(nhsn_latest_data, joined_forecasts_and_ensembles) {
pull(max_forecast) %>%
min()
forecasts_formatted <-
joined_forecasts_and_ensembles %>%
filter(forecast_date <= max_forecast_date) %>%
format_scoring_utils(disease = "covid")
joined_forecasts_and_ensembles[joined_forecasts_and_ensembles$forecast_date <= max_forecast_date,] %>%
format_scoring_utils(disease = disease)
scores <- forecasts_formatted %>%
filter(location %nin% c("US", "60", "66", "78")) %>%
hubEvals::score_model_out(
Expand Down
79 changes: 48 additions & 31 deletions R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -392,39 +392,9 @@ update_site <- function(sync_to_s3 = TRUE) {
)

# Insert into Production Reports section, skipping a line
prod_reports_index <- which(grepl("## Production Reports", report_md_content)) + 1
prod_reports_index <- which(grepl("## Weekly Fanplots 2024-2025 Season", report_md_content)) + 1
report_md_content <- append(report_md_content, report_link, after = prod_reports_index)
}
# add scoring notebooks if they exist
score_files <- dir_ls(reports_dir, regexp = ".*_backtesting_2024_2025_on_.*.html")
if (length(score_files) > 0) {
# a tibble of all score files, along with their generation date and disease
score_table <- tibble(
filename = score_files,
dates = str_match_all(filename, "[0-9]{4}-..-..")
) %>%
unnest_wider(dates, names_sep = "_") %>%
rename(generation_date = dates_1) %>%
mutate(
generation_date = ymd(generation_date),
disease = str_match(filename, "flu|covid")
)
used_files <- score_table %>%
group_by(disease) %>%
slice_max(generation_date)
# iterating over the diseases
for (row_num in seq_along(used_files$filename)) {
file_name <- path_file(used_files$filename[[row_num]])
scoring_index <- which(grepl("### Scoring this season", report_md_content)) + 1
score_link <- sprintf(
"- [%s Scoring, Rendered %s](%s)",
str_to_title(used_files$disease[[row_num]]),
used_files$generation_date[[row_num]],
file_name
)
report_md_content <- append(report_md_content, score_link, after = scoring_index)
}
}

# Write the updated content to report.md
report_md_path <- path(reports_dir, "report.md")
Expand Down Expand Up @@ -725,3 +695,50 @@ get_socrata_updated_at <- function(dataset_url, missing_value = MAX_TIMESTAMP) {
}
)
}



#' get the unique shared (geo_value, forecast_date, target_end_date) tuples present for each forecaster in `forecasts`
get_unique <- function(forecasts) {
forecasters <- forecasts %>%
pull(forecaster) %>%
unique()
distinct <- map(
forecasters,
\(x) forecasts %>%
filter(forecaster == x) %>%
select(geo_value, forecast_date, target_end_date) %>%
distinct()
)
distinct_dates <- reduce(distinct, \(x, y) x %>% inner_join(y, by = c("geo_value", "forecast_date", "target_end_date")))
mutate(
distinct_dates,
forecast_date = round_date(forecast_date, unit = "week", week_start = 6)
)
}

#' filter the external and local forecasts to just the shared dates/geos
#' some forecasters have a limited set of geos; we want to include those
#' anyways, they are `tructated_forecasters`, while the external_forecasts may
#' have previous years forecasts that we definitely want to exclude via
#' `season_start`.
filter_shared_geo_dates <- function(local_forecasts, external_forecasts, season_start = "2024-11-01", trucated_forecasters = "windowed_seasonal_extra_sources") {
viable_dates <- inner_join(
local_forecasts %>%
filter(forecaster %nin% trucated_forecasters) %>%
get_unique(),
external_forecasts %>%
filter(forecast_date > season_start) %>%
get_unique(),
by = c("geo_value", "forecast_date", "target_end_date")
)
dplyr::bind_rows(
local_forecasts %>%
mutate(
forecast_date = round_date(forecast_date, unit = "week", week_start = 6)
) %>%
inner_join(viable_dates, by = c("geo_value", "forecast_date", "target_end_date")),
external_forecasts %>%
inner_join(viable_dates, by = c("geo_value", "forecast_date", "target_end_date"))
)
}
16 changes: 8 additions & 8 deletions covid_geo_exclusions.csv
Original file line number Diff line number Diff line change
Expand Up @@ -141,14 +141,14 @@ forecast_date,forecaster,geo_value,weight
##################
# feb 12
##################
2025-02-05, all, mp, 0
2025-02-05, windowed_seasonal, all, 3
2025-02-05, windowed_seasonal_extra_sources, all, 0.0
2025-02-05, climate_linear, all, 3
2025-02-05, linear, all, 0.5
2025-02-05, linearlog, all, 0
2025-02-05, climate_base, all, 0
2025-02-05, climate_geo_agged, all, 0.0
2025-02-12, all, mp, 0
2025-02-12, windowed_seasonal, all, 3
2025-02-12, windowed_seasonal_extra_sources, all, 0.0
2025-02-12, climate_linear, all, 3
2025-02-12, linear, all, 0.5
2025-02-12, linearlog, all, 0
2025-02-12, climate_base, all, 0
2025-02-12, climate_geo_agged, all, 0.0
##################
# feb 5
##################
Expand Down
Loading