Skip to content

detect_outlr_stl() requires gap filling while detect_outlr_rm() does not? Document this. #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rachlobay opened this issue Dec 2, 2022 · 1 comment
Labels
documentation Improvements or additions to documentation P3 very low priority

Comments

@rachlobay
Copy link
Collaborator

rachlobay commented Dec 2, 2022

It seems like detect_outlr_stl() requires gap filling. For example, if we manually remove two rows from the below epi_df & pop that into detect_outlr_stl(), then we inevitably get an error about the data containing implicit gaps in time. In contrast, if we use the same data in detect_outlr_rm(), there are no complaints from that function (likely due to epi_slide). So, there should be strong documentation that explains how missing rows are handled by each function. As well, we should probably update the epi_slide vignette to explain how it handles missing rows of data (which is not likely to be uncommon for users of this package).

Ex. showing error in detect_outlr_stl() and no problem in detect_outlr_rm()

library(epidatr)
library(epiprocess)
library(dplyr)
library(tidyr)

# Load #s of new confirmed COVID-19 cases, daily, for FL
# over a fairly large time window
x <- covidcast(
  data_source = "jhu-csse",
  signals = "confirmed_incidence_num",
  time_type = "day",
  geo_type = "state",
  time_values = epirange(20200601, 20220601),
  geo_values = "fl",
  as_of = 20220606
) %>%
  fetch_tbl() %>%
  select(geo_value, time_value, cases = value) %>%
  as_epi_df()

x<- x[-c(2,10),] # Remove some rows from x

y = x$cases
x = x$time_value
# The below should all be the default values from detect_outlr_stl()
n_trend = 21
n_seasonal = 21
n_threshold = 21
seasonal_period = NULL
log_transform = FALSE
detect_negatives = FALSE
detection_multiplier = 2.5
min_radius = 0
replacement_multiplier = 0

# Below is the first part of the detect_outlr_stl() function 
  # Transform if requested
  if (log_transform) {
    # Replace all negative values with 0
    y = pmax(0, y)
    offset = as.integer(any(y == 0))
    y = log(y + offset)
  }
  
  # Make a tsibble for fabletools, setup and run STL
  z_tsibble = tsibble::tsibble(x = x, y = y, index = x)
  
  stl_formula = y ~ trend(window = n_trend) +
    season(period = seasonal_period, window = n_seasonal)
  
  stl_components = z_tsibble %>%
    fabletools::model(feasts::STL(stl_formula, robust = TRUE)) %>%
    generics::components() %>%
    tibble::as_tibble() %>%
    dplyr::select(trend:remainder) %>%
    dplyr::rename_with(~ "seasonal", tidyselect::starts_with("season")) %>% 
    dplyr::rename(resid = remainder)


# Now, the same data when inputted into detect_outlr_rm() has no apparent problem

  x <- covidcast(
    data_source = "jhu-csse",
    signals = "confirmed_incidence_num",
    time_type = "day",
    geo_type = "state",
    time_values = epirange(20200601, 20220601),
    geo_values = "fl",
    as_of = 20220606
  ) %>%
    fetch_tbl() %>%
    select(geo_value, time_value, cases = value) %>%
    as_epi_df()
  
  x<- x[-c(2,10),] # Remove some rows from x
  
  x <- x %>%
    group_by(geo_value) %>%
    mutate(outlier_info  = detect_outlr_rm(
      x = time_value, y = cases),
      detection_multiplier = 2.5) %>% #%% change this to something larger potentially or nah?
    unnest(outlier_info)
  
  x
@rachlobay rachlobay added the documentation Improvements or additions to documentation label Dec 2, 2022
@rachlobay rachlobay changed the title detect_outlr_stl() requires gap filling while detect_outlr_rm() does not? Document this. detect_outlr_stl() requires gap filling while detect_outlr_rm() does not? Document this. Dec 2, 2022
@brookslogan brookslogan added the P3 very low priority label Dec 5, 2022
@brookslogan
Copy link
Contributor

I'm marking this as P3 because it might be sort of working as intended/stated as is, and because we probably want to prioritize #256, which would change whether/how we approach this one. It'd be nice if all detect_outlr* and epi_cor methods could have the same approach here, so we don't need to describe these details for every method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation P3 very low priority
Projects
None yet
Development

No branches or pull requests

2 participants