This gives a more mathematical description of the various forecasters defined in this repository
Makes all historical data from ILI+, nhsn, and flusurv+ available, all as rows, with a source column to distinguish the origin for forecasters to filter on, and an agg_level
column to distinguish states
from hhs_regions
; nhsn data has also been included at the aggregated to hhs_region
level.
ILI+ and flusurv+ mean that each has been adjusted so that the total for the season matches nhsn’s total.
Flusurv is taken from epidata, but ili+ was constructed by Evan and given to Richard.
Currently looking to train on the 2023 season, so October 2023 through late April 2024.
1, 3, and 4 looked to be the most correlated. I’m opting to use unversioned data, as the only thing that versioning seems to record is full outages, which is better reflected by the equivalent model, but with the source missing. I’m also including 1+3+4, as a summary version; it may be prefered for decreased variable count. Also, this has to be aggregated to the weekly level from daily data.
Straightforward, it comes as weekly. We will be using finalized data because we started recording this in April 2024, so we don’t have the relevant versions. This is somewhat a problem, because this does have revision behavior. Expect the results to be better than in real production.
I don’t have a great name for this dataset; it’s the wastewater data as processed by NWSS from the sample site to the state level, which I expect to not be that great a summary (e.g. likely over-representing regions which are sampled). It is currently downloaded via local script from their website. Like NSSP, it likely has real revision behavior that is not being represented.
Note that this is a description of the flusion-like model, rather than the one written up in that paper. This is definitely the most complicated forecaster, it adds many non-ar components. First, on a per geo per group basis, each signal is transformed via: $$ x’ = (x+0.01)\frac 14 $$ $$ \bar{x} = \frac{x’-\textrm{med}(x’)}{\textrm{quant}(x’,0.95) - \textrm{quant}(x’,0.05)} $$ This differs from the data whitening used in the flusion paper in two ways: the first is subtracting the median, rather than the mean, and the second is scaling by the difference between the 5th and 95th quantiles, rather than just the 95th quantile. This is more closely in line with the RobustScaler from scikit-learn (using a much wider quantile than the default settings there). Neither we nor they have tested/examined whether a robust scaler is more appropriate than say a mean+std scaler for this case.
The model is roughly (dropping coefficients, and ignoring subtleties in the way different engines handle different variables) $$ \bar{x}t+k = \textrm{ar}(\bar{x}) + tseason + p + d + χgeo + χsource + \big\langle \bar{x}t-k\big\ranglek=0:1 + \big\langle \bar{x}t-k\big\ranglet=0:3 + \textrm{de}(\bar{x}) $$
Where:
-
$\textrm{ar}(x)$ represents a typical and configurable AR process - $tseason$ is the week of the season (starting at 1 for epiweek 40, going to 52 at epiweek 39 of the next year)
- $tyear-52$ gives the time in weeks until the last week of the year (Christmas + new years)
-
$p$ is the population -
$d$ is the population density (not in the original flusion model) - $χgeo$ is a one-hot indicator of the
geo_value
(so it actually adds # geo binary variables). This also encodes the geo_scale (hhs region vs state). - $χsource$ is a one-hot indicator of the
source
of the data. This is related to the scale but in a weighted rather than direct manner. - $\big\langle xt-k\big\ranglek=0:j$ denoting the mean for the
$j$ values before present (so the previous 2 and 4 weeks) -
$\textrm{de}$ is a collection of variables using some method of estimating the first and second derivatives. The method used in the original paper is a rather ad-hoc collection of linear and quadratic regressions. Initially we tried to use this as well viaget_poly_coefs
, but that was quite slow. Instead we are currently using the “rel_change” method ofstep_growth_rate
, applied iteratively to get higher order derivatives. For the first derivative, we use three weeks, so explicitly, this is $d_t = \frac{1}{21}\big(\frac{x_t+xt-1}{xt-2+xt-1}-1\big)$. Similarly, we estimate the second derivative using the relative change in this$d_t$ over 2 weeks, which is $\frac{1}{14}\big(\frac{d_t}{dt-1}-1\big)$.
At the moment this is only set up to use random forests as implemented by grf, as I believe its the only quantile model we have that introduces regularization.
I have a test set up to drop the state dummy variables, the source dummy variables, and the growth rates.
A flusion-adjacent model pared down to handle the case of not having the target as a predictor.
$$
\bar{x}t+k = f(tseason) + p + d + \big\langle yt-k\big\ranglek=0:1 + \big\langle yt-k\big\ranglet=0:3
$$
library(tidyverse)
res <- tibble(season_week = 1:52)
res %>% mutate(
first_term = sinpi((pmin(season_week, 36) - 1)/35),
second_term = sinpi(2*(pmin(season_week, 36) - 1)/35)
) %>%
pivot_longer(cols = c("first_term", "second_term")) %>%
ggplot(aes(x=season_week, y = value, color = name)) +geom_line()
I am also going to test having and dropping the robust scaling, the population
A simple model, which predicts using
$$
xt+k = ar(x)
$$
where
This is part of the covid forecaster we used last year, that performed approximately as well as any of the others.
It takes the form:
$$
xt+k = ar(x) + ∑i=i_1,\ldots i_n \big\langle xt-k\big\ranglek=0:i + \bigg\langle \big(xt-k-\big\langle xt-k\big\ranglek=0:j_0\big)^2\bigg\ranglek=0:j_1
$$
where
Like scaled_pop, we’re testing this both with and without population scaling, and with and without the extra data.
This is just exactly the flatline forecaster from epipredict.