flatline-forecaster.qmd

# Introducing the flatline forecaster

The flatline forecaster is a very simple forecasting model intended for `epi_df` data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The prediction intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (`geo_value` and any additional keys) in the `epi_df`. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (`ahead`) for each unique combination of key variables. This forecaster is comparable to the baseline used by the [COVID Forecast Hub](https://covid19forecasthub.org).

## Example of using the flatline forecaster

```{r}
#| echo: false
#| message: false
#| warning: false
source("_common.R")
```


We will continue to use the `case_death_rate_subset` dataset that comes with the
`epipredict` package. In brief, this is a subset of the JHU daily COVID-19 cases
and deaths by state. While this dataset ranges from Dec 31, 2020 to Dec 31, 
2021, we will only consider a small subset at the end of that range to keep our
example relatively simple.

```{r}
jhu <- case_death_rate_subset %>%
  dplyr::filter(time_value >= as.Date("2021-09-01"))

jhu
```

### The basic mechanics of the flatline forecaster

Suppose that our goal is to predict death rates one week ahead of the last date available for each state. Mathematically, on day $t$, we want to predict new deaths $y$ that are $h$ days ahead at many locations $j$. So for each location, we'll predict
$$
\hat{y}_{j, {t+h}} = y_{j, t}
$$
where $t$ is 2021-12-31, $h$ is 7 days, and $j$ is the state in our example.

Now, the simplest way to create and train a flatline forecaster to predict the death rate one week into the future is to input the `epi_df` and the name of the column from it that we want to predict in the `flatline_forecaster` function. 

```{r}
one_week_ahead <- flatline_forecaster(jhu, 
                                      outcome = "death_rate")
one_week_ahead
```

The result is a S3 object, which contains three objects - metadata, a tibble of predictions, 
and a S3 object of class `epi_workflow`. There are a few important things to note here.
First, the fitted model object contained in the `epi_workflow`
object could be used any time in the future to create different forecasts. 
Next, the tibble of predicted values and prediction intervals is for each location 7 days 
after the last available time value in the data, which is Dec 31, 2021. Note that 7 days is the default
number of time steps ahead of the forecast date in which forecasts should be
produced. To change this, you must change the value of the `ahead` parameter
in the list of additional arguments `flatline_args_list()`. Let's change this
to 5 days to get some practice.

```{r}
five_days_ahead <- flatline_forecaster(
  jhu, 
  outcome = "death_rate",
  flatline_args_list(ahead = 5L)
)

five_days_ahead
```

We could also specify that we want a 80% prediction interval by changing the 
levels. The default 0.05 and 0.95 levels/quantiles give us 90% prediction 
interval.

```{r}
five_days_ahead <- flatline_forecaster(
  jhu, 
  outcome = "death_rate",
  flatline_args_list(ahead = 5L, 
                     levels = c(0.1, 0.9))
)

five_days_ahead
```

To see the other arguments that you may modify, please see `?flatline_args_list()`. For now, we will move on to looking at the workflow.

```{r}
five_days_ahead$epi_workflow
```

The fitted model here was based on minimal pre-processing of the data, 
estimating a flatline model, and then post-processing the results to be 
meaningful for epidemiological tasks. To look deeper into the pre-processing 
or post-processing parts individually, you can use the `extract_preprocessor()`
or `extract_frosting()` functions on the `epi_workflow`, respectively. 
For example, let’s examine the pre-processing part in more detail.

```{r}
#| results: false
library(workflows)
extract_preprocessor(five_days_ahead$epi_workflow)
```

```{r}
#| echo: false
#| results: asis
#| message: true
#| collapse: true
extract_preprocessor(five_days_ahead$epi_workflow)
```

Under Operations, we can see that the pre-processing operations were to lead the
death rate by 5 days (`step_epi_ahead()`) and that the \# of recent observations
used in the training window were not limited (in `step_training_window()` as
`n_training = Inf` in `flatline_args_list()`).

For symmetry, let's have a look at the post-processing.

```{r}
#| results: false
extract_frosting(five_days_ahead$epi_workflow)
```

```{r}
#| echo: false
#| collapse: true
#| results: false
#| message: true
extract_frosting(five_days_ahead$epi_workflow)
```

The post-processing operations in the order the that were performed were to create the predictions and the prediction intervals, add the forecast and target dates and bound the predictions at zero.

We can also easily examine the predictions themselves.

```{r}
five_days_ahead$predictions
```

The results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 80% prediction interval is available for every state (`geo_value`).

The figure below displays the prediction and prediction interval for three sample states: Arizona, Florida, and New York.

```{r}
#| fig-height: 5
#| code-fold: true
samp_geos <- c("az", "ny", "fl")

hist <- jhu %>% 
  filter(geo_value %in% samp_geos)

preds <- five_days_ahead$predictions %>% 
  filter(geo_value %in% samp_geos) %>% 
  mutate(q = nested_quantiles(.pred_distn)) %>% 
  unnest(q) %>%
  pivot_wider(names_from = tau, values_from = q)

ggplot(hist, aes(color = geo_value)) +
  geom_line(aes(time_value, death_rate)) +
  theme_bw() +
  geom_errorbar(data = preds, aes(x = target_date, ymin = `0.1`, ymax = `0.9`)) +
  geom_point(data = preds, aes(target_date, .pred)) +
  geom_vline(data = preds, aes(xintercept = forecast_date)) +
  scale_colour_viridis_d(name = "") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  facet_grid(geo_value ~ ., scales = "free_y") +
  theme(legend.position = "none") +
  labs(x = "", y = "Incident deaths per 100K\n inhabitants") 
```

The vertical black line is the forecast date. Here the forecast seems pretty reasonable based on the past observations shown. In cases where the recent past is highly predictive of the near future, a simple flatline forecast may be respectable, but in more complex situations where there is more uncertainty of what's to come, the flatline forecaster may be best relegated to being a baseline model and nothing more.

Take for example what happens when we consider a wider range of target dates. That is, we will now predict for several different horizons or `ahead` values - in our case, 1 to 28 days ahead, inclusive. Since the flatline forecaster function forecasts at a single unique `ahead` value, we can use the `map()` function from `purrr` to apply the forecaster to each ahead value we want to use. And then we row bind the list of results.

```{r}
out_df <- map(1:28, ~ flatline_forecaster(
  epi_data = jhu, 
  outcome = "death_rate",
  args_list = flatline_args_list(ahead = .x))$predictions) %>% 
  list_rbind()
```

Then we proceed as we did before. The only difference from before is that we're using `out_df` where we had `five_days_ahead$predictions`.

```{r}
#| fig-height: 5
#| code-fold: true
preds <- out_df %>% 
  filter(geo_value %in% samp_geos) %>% 
  mutate(q = nested_quantiles(.pred_distn)) %>% 
  unnest(q) %>%
  pivot_wider(names_from = tau, values_from = q)

ggplot(hist) +
  geom_line(aes(time_value, death_rate)) +
  geom_ribbon(
    data = preds, 
    aes(x = target_date, ymin = `0.05`, ymax = `0.95`, fill = geo_value)) +
  geom_point(data = preds, aes(target_date, .pred, colour = geo_value)) +
  geom_vline(data = preds, aes(xintercept = forecast_date)) +
  scale_colour_viridis_d() +
  scale_fill_viridis_d(alpha = .4) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  scale_y_continuous(expand = expansion(c(0, .05))) +
  facet_grid(geo_value ~ ., scales = "free_y") +
  labs(x = "", y = "Incident deaths per 100K\n inhabitants") +
  theme(legend.position = "none")
```

Now you can really see the flat line trend in the predictions. And you may also observe that as we get further away from the forecast date, the more unnerving using a flatline prediction becomes. It feels increasingly unnatural.

So naturally the choice of forecaster relates to the time frame being considered. In general, using a flatline forecaster makes more sense for short-term forecasts than for long-term forecasts and for periods of great stability than in less stable times. Realistically, periods of great stability are rare. And moreover, in our model of choice we want to take into account more information about the past than just what happened at the most recent time point. So simple forecasters like the flatline forecaster don't cut it as actual contenders in many real-life situations. However, they are not useless, just used for a different purpose. A simple model is often used to compare a more complex model to, which is why you may have seen such a model used as a baseline in the [COVID Forecast Hub](https://covid19forecasthub.org). The following [blog post](https://delphi.cmu.edu/blog/2021/09/30/on-the-predictability-of-covid-19/#ensemble-forecast-performance) from Delphi explores the Hub's ensemble accuracy relative to such a baseline model.

## What we've learned in a nutshell

Though the flatline forecaster is a very basic model with limited customization, it is about as steady and predictable as a model can get. So it provides a good reference or baseline to compare more complicated models to.