Skip to content

Debug apparent case_rate_7d_av discrepancies #251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brookslogan opened this issue Dec 2, 2022 · 6 comments · Fixed by #402
Closed

Debug apparent case_rate_7d_av discrepancies #251

brookslogan opened this issue Dec 2, 2022 · 6 comments · Fixed by #402
Assignees
Labels
bug Something isn't working P0 high priority

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Dec 2, 2022

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(epiprocess)
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
jhu_csse_daily_subset %>%
  # (we don't need a `complete` here as time values are already regularly spaced
  # and the same for every geo value)
  group_by(geo_value) %>%
  # [vvvvv BUGGY, but not the source of the discrepancy]
  epi_slide(cases_7d_av2 = sum(cases)/7, n=7L) %>%
  ungroup() %>%
  mutate(discrepancy = cases_7d_av2 - cases_7d_av) %>%
  # get worst discrepancy + context:
  # vvvv also a little buggy; produces error if it's at the beginning of the first geo_value's time series
  slice(which.max(abs(discrepancy))-(6:0)) %>%
  filter(geo_value == tail(geo_value, 1L)) %>%
  # select only relevant columns
  select(geo_value, time_value, cases, cases_7d_av, cases_7d_av2, discrepancy)
#> An `epi_df` object, 7 x 6 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-23 13:17:07
#> 
#> # A tibble: 7 × 6
#>   geo_value time_value cases cases_7d_av cases_7d_av2 discrepancy
#> * <chr>     <date>     <dbl>       <dbl>        <dbl>       <dbl>
#> 1 ny        2020-11-09  3446       3070         3070           0 
#> 2 ny        2020-11-10     0       2642.        2774.        133.
#> 3 ny        2020-11-11  4820       3593.        3168.       -425.
#> 4 ny        2020-11-12  5225       3939.        3454.       -484.
#> 5 ny        2020-11-13  5666       4289.        3805.       -484.
#> 6 ny        2020-11-14  5577       4529         4045.       -484.
#> 7 ny        2020-11-15  4192       4617.        4132.       -484.
(3446 + 0 + 4820 + 5225 + 5666 + 5577 + 4192)/7
#> [1] 4132.286

Created on 2022-12-01 by the reprex package (v2.0.1)

Also, fix apparent documentation bug:

\href{https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html}{From the COVIDcast Epidata API}: `case_rate_7d_av` is taken directly from the JHU CSSE \href{https://github.com/CSSEGISandData/COVID-19}{COVID-19 GitHub repository} without changes. The 7-day average signals are computed by Delphi by calculating moving averages of the preceding 7 days, so the signal for June 7 is the average of the underlying data for June 1 through 7, inclusive.

We want to say that the non-7dav signals are directly from JHU CSSE, while Delphi calculated the moving averages.

@brookslogan brookslogan added bug Something isn't working P0 high priority labels Dec 2, 2022
@brookslogan
Copy link
Contributor Author

brookslogan commented Dec 2, 2022

Regenerating [the data object would resolve] this discrepancy, although using the epi_slide above would give the impression that some remained, because partial data windows were not handled appropriately!

@dshemetov dshemetov self-assigned this Oct 24, 2023
@dshemetov dshemetov added this to the Epiprocess Issue Triage milestone Oct 24, 2023
@brookslogan
Copy link
Contributor Author

  • regenerate data object
  • fix doc bug to note that Delphi did 7dav-ing

@dshemetov
Copy link
Contributor

Working on regenerating this object, I looked to the API. It turns out we haven't fixed our smoothing discrepancies there either. Here's some code that demonstrates that:

df1_cases <- epidatr::pub_covidcast(
  "jhu-csse",
  "confirmed_incidence_num",
  "state",
  "day",
  "*",
  epidatr::epirange("2020-03-01", "2021-12-31")
)
df2_cases <- epidatr::pub_covidcast(
  "jhu-csse",
  "confirmed_7dav_incidence_num",
  "state",
  "day",
  "*",
  epidatr::epirange("2020-03-01", "2021-12-31")
)
df <- inner_join(df1_cases, df2_cases %>% select(geo_value, time_value, value), by = c("geo_value", "time_value"))

df %>%
  as_epi_df() %>%
  group_by(geo_value) %>%
  epi_slide(cases_7d_av2 = sum(value.x) / 7, before = 7L) %>%
  ungroup() %>%
  mutate(discrepancy = cases_7d_av2 - value.y) %>%
  {
    quantile <- quantile(.$discrepancy, 0.9)
    filter(., .$discrepancy > quantile)
  } %>%
  select(geo_value, time_value, cases_7d_av2, discrepancy) %>%
  filter(geo_value == "ny")

### Output
An `epi_df` object, 323 x 4 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2023-03-10

# A tibble: 323 × 4
   geo_value time_value cases_7d_av2 discrepancy
 * <chr>     <date>            <dbl>       <dbl>
 1 ny        2020-03-29        7034         683 
 2 ny        2020-03-30        7353.        813.
 3 ny        2020-03-31        7853.        685.
 4 ny        2020-04-01        9242.        737.
 5 ny        2020-04-02        9991.        937.
 6 ny        2020-04-03       10580.       1043.
 7 ny        2020-04-04       11001.       1100.
 8 ny        2020-04-05       11471.       1035.
 9 ny        2020-04-06       11845.       1002.
10 ny        2020-04-07       11902.       1313.
# ℹ 313 more rows
# ℹ Use `print(n = ...)` to see more rows

I'm planning to regenerate this just by requesting confirmed_incidence_num and then calculating the 7dav using epi_slide.

@brookslogan
Copy link
Contributor Author

@dshemetov When I last checked, simply re-downloading from API fixed it. In your test script, things might be messed up due to before = 7L rather than before = 6L + there might still be some annoyance from completion, the first 6 time values, etc. Really need to get around to #256.

@dshemetov
Copy link
Contributor

I already made a PR fix that calculates manually, but oof is a 7day window really specified with a before=6?

@brookslogan
Copy link
Contributor Author

brookslogan commented Jan 25, 2024

Yes, before=6. Not great, but the reason behind it was because we were trying to mirror interfaces betweeen epi_slide and epix_slide and n deceived people in both epi_slide and epix_slide. [This also matches slider's interface... I imagine they might have encountered a similar dilemma?] #256 decided approach is designed to go back to n and align but [adding completion so we're] not deceiving people in epi_slide [at least in grouped-by-all-geo&other-key-cols-case] and only being tricky in epix_slide case if you have negative-latency observations (i.e., predictions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 high priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants