Skip to content

Repair initial issue for Quidel:Omicron edition #1520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
krivard opened this issue Feb 9, 2022 · 20 comments · Fixed by #1522
Closed
2 tasks done

Repair initial issue for Quidel:Omicron edition #1520

krivard opened this issue Feb 9, 2022 · 20 comments · Fixed by #1522
Assignees
Labels
data quality Missing data, weird data, broken data

Comments

@krivard
Copy link
Contributor

krivard commented Feb 9, 2022

Summary

The omicron changes for Quidel in v0.3.2 came with a whole heap of regions which should no longer be reported. In order to prioritize correct display in the dashboard, we used the following procedure:

  • Run Quidel v0.3.1 for the full calendar of reference dates. Initialize the archivediffer cache with this output.
  • Run Quidel v0.3.2 for the full calendar of reference dates. Given the cached files and the new output, archivediffer should have marked the removed regions as deleted in the files delivered to /common/covidcast/receiving.

Archivediffer seems to have noticed the deletions but not marked them in the output.

We need to:

  • figure out why the deletions weren't marked correctly
  • find a way to generate deletion annotations for everything removed by quidel between v0.3.1 and v0.3.2.

Details

Here’s a sample: Washington County, KY (FIPS:21229)

There are 612 days the pre-omicron code published a value for Washington County. The following environment was configured using pre-omicron code (v0.3.1) to output 1000 days of data, which carries us through the beginning of the calendar for Quidel:

covidcast-indicators/quidel_covidtest$ git status -uno
HEAD detached at covidcast-indicators/v0.3.1
covidcast-indicators/quidel_covidtest$ grep ^21229 receiving/*_county_covid_ag_smoothed_pct_positive.csv |head
receiving/20200526_county_covid_ag_smoothed_pct_positive.csv:21229,4.913667717688675,3.0568697758622676,50.0
receiving/20200527_county_covid_ag_smoothed_pct_positive.csv:21229,4.6693051871295275,2.9837161653964275,50.0
receiving/20200528_county_covid_ag_smoothed_pct_positive.csv:21229,4.875077252904564,3.0454600541418,50.0
receiving/20200529_county_covid_ag_smoothed_pct_positive.csv:21229,5.258998129413856,3.1567158618292117,50.0
receiving/20200530_county_covid_ag_smoothed_pct_positive.csv:21229,5.0472663892846406,3.09597073928053,50.00000000000001
receiving/20200531_county_covid_ag_smoothed_pct_positive.csv:21229,6.006233899782876,3.3602039947580162,50.00000000000001
receiving/20200601_county_covid_ag_smoothed_pct_positive.csv:21229,5.751155128909549,3.292536188333159,50.0
receiving/20200602_county_covid_ag_smoothed_pct_positive.csv:21229,5.001733541585632,3.0827131418201743,50.0
receiving/20200603_county_covid_ag_smoothed_pct_positive.csv:21229,5.025413601051074,3.0896167343004803,50.0
receiving/20200604_county_covid_ag_smoothed_pct_positive.csv:21229,5.139949494826776,3.1227419639583016,50.0
covidcast-indicators/quidel_covidtest$ grep ^21229 receiving/*_county_covid_ag_smoothed_pct_positive.csv |wc -l
612

The archivediffer cache in production shows no days of data for Washington County:

[indicators@delphi-master-prod-01 quidel_covidtest]$ grep ^21229 archivediffer_cache/*_county_covid_ag_smoothed_pct_positive.csv |wc -l
0

The indicator logs for yesterday’s first v0.3.2 run included many lines like this:

Diff has deleted indices in ./receiving/20220114_county_covid_ag_smoothed_pct_positive.csv that have been coded as nans.

However, while the files in /common/covidcast/archive/successful have 611 entries for Washington County (one short is weird but w/e), they contain no nan annotation columns that would have marked these nan values as deleted:

[indicators@delphi-master-prod-01 quidel_covidtest]$ zgrep ^21229 /common/covidcast/archive/successful/quidel/*county_covid_ag_smoothed_pct_positive.csv.gz |head
/common/covidcast/archive/successful/quidel/20200526_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200527_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200528_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200529_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200530_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200531_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200601_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200602_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200603_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200604_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
[indicators@delphi-master-prod-01 quidel_covidtest]$ zgrep ^21229 /common/covidcast/archive/successful/quidel/*county_covid_ag_smoothed_pct_positive.csv.gz |wc -l
611
[indicators@delphi-master-prod-01 quidel_covidtest]$ zcat /common/covidcast/archive/successful/quidel/20200526_county_covid_ag_smoothed_pct_positive.csv.gz |head
geo_id,val,se,sample_size
01117,NA,NA,NA
12063,NA,NA,NA
12101,NA,NA,NA
12103,NA,NA,NA
13077,NA,NA,NA
13097,NA,NA,NA
13113,NA,NA,NA
13171,NA,NA,NA
13223,NA,NA,NA

I've put file archives of everything above online:

@krivard krivard added the data quality Missing data, weird data, broken data label Feb 9, 2022
@krivard krivard self-assigned this Feb 9, 2022
@dshemetov
Copy link
Contributor

Looking into this, I'm not sure where the error is, so I'll write out my thinking of the archiver's logic for deleted old style files (i.e. no missingness columns #866):

  1. let's take Washington County, KY (FIPS:21229); supposing there's data on 2020-05-26, but no data for the county on 2020-05-27, the archivediffer will mark that geo as deleted but without an annotation, resulting in a new row
  2. geo_id,val,se,sample_size
    21229,NA,NA,NA
  3. on 2020-05-28, if there is no data, the same row above should be pushed forward and should not trigger archivediffer

But as I see in one of your commands, it seems that archivediffer is repeatedly producing NAN rows as if they are new e.g.

/common/covidcast/archive/successful/quidel/20200527_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200528_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA
/common/covidcast/archive/successful/quidel/20200529_county_covid_ag_smoothed_pct_positive.csv.gz:21229,NA,NA,NA

So it seems that I need to check the logic on (3).

Wrt to deletion annotations - those would be nice to have and would add an additional interpretation layer to the missingness, but I think something else is going on here that's causing archive differ to treat a new row of NAs as different from an identical row of NAs.

@dshemetov
Copy link
Contributor

I have an idea of what's going on. Assuming the same setup as above:

  1. on 2020-05-27 the archive differ wrote nans for all the rows, since they had no data
  2. on 2020-05-28 the archive differ compares the file full of nan rows to a totally empty file (no data) and finds that all the indices have been deleted and marks them as deleted again, without checking if this has already happened.

So that's a bit of logic I missed earlier. The solution would then be to add a step to archivediffer that checks if the deleted_df rows are already present in the before file.

@krivard
Copy link
Contributor Author

krivard commented Feb 9, 2022

But as I see in one of your commands, it seems that archivediffer is repeatedly producing NAN rows as if they are new e.g.

Recall that CSV files are named for their reference date, and not their issue date. Archivediffer compares sequential issue dates, not sequential reference dates -- it looks at the data we published in 20200527_county_covid_ag_smoothed_pct_positive.csv.gz yesterday and compares that to the data we generated for 20200527_county_covid_ag_smoothed_pct_positive.csv.gz today. This is correct behavior.

What's missing is the annotations -- I assumed that archivediffer would add annotations to an old-style file when inserting deletion records. Is that not the case?

@dshemetov
Copy link
Contributor

dshemetov commented Feb 10, 2022

Ah right. Well considering the dates in my previous comments as issue dates instead, the logic still holds. I wrote a fix PR, about to put it up (#1522).

What's missing is the annotations -- I assumed that archivediffer would add annotations to an old-style file when inserting deletion records. Is that not the case?

I didn't do that here, because I wanted to avoid spreading NAN setting code in many places. Ideally it would all happen in the indicator, but I had to add minimal fail-safe logic in acquisition (most indicators' NA columns are currently set with that logic).

We can reconsider that though. Mainly, we would have to think about how to set the default values for the missing columns (in the below, new-old means new-style file in archiver and old-style file in receiving):

  • in the case of old-old, it could be sensible to use just NOT_MISSING (when a value is present), OTHER (when encountering a nan), or DELETED (when missing)
  • in the case of new-old, we could simply copy over the missing columns from the new-style file; here's some case analysis
    • "same": new-style 21229,1, 0.1, 10, 0, 0, 0 --> old-style 21229,1, 0.1, 10 would produce the new line 21229,1, 0.1, 10, 0, 0, 0 which is valid
    • "changed values": new-style 21229,1, 0.1, 10, 0, 0, 0 --> old-style 21229,1, 0.1, 20 would produce the new line 21229,1, 0.1, 20, 0, 0, 0 which is valid
    • "no more NAs": new-style 21229,1, 0.1, NA, 0, 0, 1 --> old-style 21229,1, 0.1, 10 would produce the new line 21229,1, 0.1, 10, 0, 0, 1 which is invalid, but the incorrect 1 code would be caught and corrected in acquisition to 0 (NOT MISSING)
    • "new NAs": new-style 21229,1, 0.1, 10, 0, 0, 0 --> old-style 21229, 1, 0.1, NA would produce the new line 21229, 1, 0.1, NA, 0, 0, 0 which is invalid, but the incorrect 0 code would be caught and corrected in acquisition to 5 (OTHER)
    • "deleted": new-style 21229, 1, 0.1, 10, 0, 0, 0 --> old-style entry is missing 21229 geo_value, would produce the new line 21229, NA, NA, NA, 4, 4, 4 which is valid
    • "added": new-style 21229 is missing --> old-style 21229,1, 0.1, NA would produce the new line 21229,1, 0.1, NA, 0, 0, 0 which is invalid, but the incorrect 0 code would be caught and corrected in acquisition to 5 (OTHER)
  • a similar logic can be applied to old-new comparisons

EDIT: To summarize two hidden options in the above, we can either: a) write sanity checking code for the NAN column in the archiver or b) we can punt and let acquisition do it. The pro with a) is that sanity checking would be localized, but the con is that it's duplicated with what's in acquisitions. The pro-con with b) is reversed (con: sanity checking is not localized, pro: code not duplicated). Personally, I'm in favor of (b) and to ameliorate the non-local code, could write an eye-catching comment in archive differ referring to the acquisition validations.

@krivard thoughts?

@krivard
Copy link
Contributor Author

krivard commented Feb 10, 2022

help me understand your nomenclature here --

label cache file export file when it happens
old-old no annotations no annotations Typical case for indicators that haven't merged their NANs PR yet
new-old yes annotations no annotations ??
old-new no annotations yes annotations Only happens at first run after an indicator merges its NANs PR

this is likely related to the confused comment I left in your PR though so probably best to resolve it there first and then come back to this once we're on the same page

@dshemetov
Copy link
Contributor

The table is correct. The new-old case would happen after the first time an old-old indicator processes data and the archiver fills in the missing columns - the cache would contain a new-style file, but the next day the indicator would produce another old-style file.

@dshemetov
Copy link
Contributor

dshemetov commented Feb 11, 2022

Ok, I think I jumped to a conclusion on the fix in #1522. I'm now 80% certain the functionality there is actually unneeded - we only need the deletions to show up in a diff, they don't need to be added to the new archived file (since if they get undeleted later, they will just be added lines).

So the main issue then is just the missing annotations, which I could add to the archive differ in that PR (contingent on some discussion there).

@dshemetov
Copy link
Contributor

dshemetov commented Feb 14, 2022

Also, with the clarifications in the other post, I now think that the new-old case won't happen - if missing annotations are added in the diffs only, then the comparisons will always be between old-old files.

However, a potential issue here is that the archiver will continually add missing columns, since they are never present, and there is nothing to compare to to know if the columns have been added already. I think this suggests that we probably should cache the annotations the archiver makes.

@krivard
Copy link
Contributor Author

krivard commented Feb 16, 2022

the archiver will continually add missing columns

I don't know about "continually" -- in theory this only happens for files that have deletions. The only missingness columns that should be added by archivediffer are "deleted" (for deleted rows) and "not missing" (for undeleted rows in a file with deleted rows). afaik nan values are still illegal for old-style files.

@dshemetov
Copy link
Contributor

dshemetov commented Feb 16, 2022

So iirc the "se" and "sample_size" columns could have nan values and "value" couldn't. But this goes back to the summary in this comment - we could just let acquisition handle sanity checking annotation, since it already has that logic, leave a comment in archive differ about it. and keep archivediffer annotation logic simple.

So it seems like a possible logic for old-old-style file comparison is:

  1. diff a file,
  2. if there are no deletions, but there are other changes and additions, export the diff without annotations,
  3. if there are deletions, export the diff with "deleted" annotations and write "not missing" for the other changed rows and let acquisition fix the codes where this is wrong

@krivard
Copy link
Contributor Author

krivard commented Feb 18, 2022

write "not missing" for the other changed rows and let acquisition fix the codes where this is wrong

if this is something acquisition already does, great & let's do it (though if you could hunt down the test case that checks this it would appease my brain weasels 🐹 )

if this is something we'd need to add to acquisition though, i'd kinda rather handle it here even if it means some code duplication. there have been enough times in the past that we've resurrected issues by digging files out of the acquisition successful archive that i'd rather avoid the future confusion of examining them and seeing an obvious annotations conflict.

@dshemetov
Copy link
Contributor

dshemetov commented Feb 19, 2022

Can emoji be links? Let's find out 🐹💆 (validation code; ignore the strongly principled stance in the function docstring which claims we throw errors instead of making inferences; I wrote that in my younger and more idealistic days 🍎 and never updated it).

Let me know if that looks satisfying, otherwise I can add similar fail-safes here.

@krivard
Copy link
Contributor Author

krivard commented Feb 21, 2022

Oh! This is even better:

https://github.com/cmu-delphi/delphi-epidata/blob/1849ce655d35619b297c84210cf664a17987d6d0/src/acquisition/covidcast/csv_importer.py#L241-L244

Will this work: for your (3) above, instead of archivediffer outputting "not missing" for nondeleted rows, output NA. Then, acquisition will fill in OTHER or NOT_MISSING as it sees fit -- ?

@dshemetov
Copy link
Contributor

Yup, that should work!

@krivard
Copy link
Contributor Author

krivard commented Feb 21, 2022

Excellent -- make it so! do you need anything else from me for now?

@dshemetov
Copy link
Contributor

I don't think so, thanks!

@dshemetov
Copy link
Contributor

dshemetov commented Feb 24, 2022

@krivard so small complication. We interpret the missing columns as a Pandas int data type, but that doesn't allow NAs. We would need to switch to "Int64" here. An alternative would be to default all nondeleted rows to Nans.OTHER instead of NA, the acquisition fixing would work the same way.

@krivard
Copy link
Contributor Author

krivard commented Feb 25, 2022

Let's do Int64 -- I'll double-check with Brian that doubling the memory requirements won't cause any problems, but I don't think archivediffer is enough of a memory hog for it to matter

@krivard
Copy link
Contributor Author

krivard commented Mar 2, 2022

Whoops, code is done, but there's a bunch of data stuff left to do

@krivard krivard reopened this Mar 2, 2022
@krivard
Copy link
Contributor Author

krivard commented Mar 17, 2022

Initial issue repair complete.

@krivard krivard closed this as completed Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data quality Missing data, weird data, broken data
Projects
None yet
2 participants