-
Notifications
You must be signed in to change notification settings - Fork 16
Repair initial issue for Quidel:Omicron edition #1520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looking into this, I'm not sure where the error is, so I'll write out my thinking of the archiver's logic for deleted old style files (i.e. no missingness columns #866):
But as I see in one of your commands, it seems that archivediffer is repeatedly producing NAN rows as if they are new e.g.
So it seems that I need to check the logic on (3). Wrt to deletion annotations - those would be nice to have and would add an additional interpretation layer to the missingness, but I think something else is going on here that's causing archive differ to treat a new row of NAs as different from an identical row of NAs. |
I have an idea of what's going on. Assuming the same setup as above:
So that's a bit of logic I missed earlier. The solution would then be to add a step to archivediffer that checks if the |
Recall that CSV files are named for their reference date, and not their issue date. Archivediffer compares sequential issue dates, not sequential reference dates -- it looks at the data we published in What's missing is the annotations -- I assumed that archivediffer would add annotations to an old-style file when inserting deletion records. Is that not the case? |
Ah right. Well considering the dates in my previous comments as issue dates instead, the logic still holds. I wrote a fix PR, about to put it up (#1522).
I didn't do that here, because I wanted to avoid spreading NAN setting code in many places. Ideally it would all happen in the indicator, but I had to add minimal fail-safe logic in acquisition (most indicators' NA columns are currently set with that logic). We can reconsider that though. Mainly, we would have to think about how to set the default values for the missing columns (in the below, new-old means new-style file in archiver and old-style file in receiving):
EDIT: To summarize two hidden options in the above, we can either: a) write sanity checking code for the NAN column in the archiver or b) we can punt and let acquisition do it. The pro with a) is that sanity checking would be localized, but the con is that it's duplicated with what's in acquisitions. The pro-con with b) is reversed (con: sanity checking is not localized, pro: code not duplicated). Personally, I'm in favor of (b) and to ameliorate the non-local code, could write an eye-catching comment in archive differ referring to the acquisition validations. @krivard thoughts? |
help me understand your nomenclature here --
this is likely related to the confused comment I left in your PR though so probably best to resolve it there first and then come back to this once we're on the same page |
The table is correct. The new-old case would happen after the first time an old-old indicator processes data and the archiver fills in the missing columns - the cache would contain a new-style file, but the next day the indicator would produce another old-style file. |
Ok, I think I jumped to a conclusion on the fix in #1522. I'm now 80% certain the functionality there is actually unneeded - we only need the deletions to show up in a diff, they don't need to be added to the new archived file (since if they get undeleted later, they will just be added lines). So the main issue then is just the missing annotations, which I could add to the archive differ in that PR (contingent on some discussion there). |
Also, with the clarifications in the other post, I now think that the new-old case won't happen - if missing annotations are added in the diffs only, then the comparisons will always be between old-old files. However, a potential issue here is that the archiver will continually add missing columns, since they are never present, and there is nothing to compare to to know if the columns have been added already. I think this suggests that we probably should cache the annotations the archiver makes. |
I don't know about "continually" -- in theory this only happens for files that have deletions. The only missingness columns that should be added by archivediffer are "deleted" (for deleted rows) and "not missing" (for undeleted rows in a file with deleted rows). afaik nan values are still illegal for old-style files. |
So iirc the "se" and "sample_size" columns could have nan values and "value" couldn't. But this goes back to the summary in this comment - we could just let acquisition handle sanity checking annotation, since it already has that logic, leave a comment in archive differ about it. and keep archivediffer annotation logic simple. So it seems like a possible logic for old-old-style file comparison is:
|
if this is something acquisition already does, great & let's do it (though if you could hunt down the test case that checks this it would appease my brain weasels 🐹 ) if this is something we'd need to add to acquisition though, i'd kinda rather handle it here even if it means some code duplication. there have been enough times in the past that we've resurrected issues by digging files out of the acquisition successful archive that i'd rather avoid the future confusion of examining them and seeing an obvious annotations conflict. |
Can emoji be links? Let's find out 🐹💆 (validation code; ignore the strongly principled stance in the function docstring which claims we throw errors instead of making inferences; I wrote that in my younger and more idealistic days 🍎 and never updated it). Let me know if that looks satisfying, otherwise I can add similar fail-safes here. |
Oh! This is even better: Will this work: for your (3) above, instead of archivediffer outputting "not missing" for nondeleted rows, output NA. Then, acquisition will fill in OTHER or NOT_MISSING as it sees fit -- ? |
Yup, that should work! |
Excellent -- make it so! do you need anything else from me for now? |
I don't think so, thanks! |
Let's do |
Whoops, code is done, but there's a bunch of data stuff left to do |
Initial issue repair complete. |
Summary
The omicron changes for Quidel in v0.3.2 came with a whole heap of regions which should no longer be reported. In order to prioritize correct display in the dashboard, we used the following procedure:
Archivediffer seems to have noticed the deletions but not marked them in the output.
We need to:
Details
Here’s a sample: Washington County, KY (FIPS:21229)
There are 612 days the pre-omicron code published a value for Washington County. The following environment was configured using pre-omicron code (v0.3.1) to output 1000 days of data, which carries us through the beginning of the calendar for Quidel:
The archivediffer cache in production shows no days of data for Washington County:
The indicator logs for yesterday’s first v0.3.2 run included many lines like this:
However, while the files in /common/covidcast/archive/successful have 611 entries for Washington County (one short is weird but w/e), they contain no nan annotation columns that would have marked these nan values as deleted:
I've put file archives of everything above online:
The text was updated successfully, but these errors were encountered: