-
Notifications
You must be signed in to change notification settings - Fork 17
Update archiver and export utils for nancodes and deletion-handling #1252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* update export utility to export, validate, and test the missing cols * handle deleted rows: replaced with nan values * handle deleted files: replace with an empty CSV file * handle comparisons between CSVs with/without missing cols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
FYI it is expected that the first run of acquisition uploads will be quite large after this gets merged (the .csv will contain 3 new columns, so all will be new). |
Also, want to make sure that my archiver changes sound reasonable @krivard, specifically replacing deleted rows with nan rows and replacing deleted files with nan rows. |
Correct
Extremely incorrect; most of our indicators only generate files for the most recent few days/weeks and not the full timeline. If not-producing those days gets interpreted as deletion, we'll eventually delete the whole timeline by accident. |
Hm, I see what you're saying. I have a feeling of inconsistency about this though. Currently the archiver treats the absence of a file exactly the same as if an exact copy of the file in the cache was produced. In most cases, this won't be an issue, since this will result in no acquisition update for that day and that's generally what we want it to do in that case. But in the case of an actual whole-day deletion, our archiver would miss this. I think I'm coming around to seeing that this is probably ok, since whole-day deletion is unlikely. Just wanted to put my finger on why I sensed an inconsistency. I can take out the deleted file nan-replacement part and we can trust that deletions will occur in select geos, on select days, not in all geos, on select days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(fixing comments)
also is there a unit test here that ensures that indicators that don't yet publish the missing_*
columns won't crash export/archive?
Co-authored-by: Katie Mazaitis <[email protected]>
Co-authored-by: Katie Mazaitis <[email protected]>
@krivard In In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Description
Update the archiver utility and the export utility for nancode columns. This PR is meant to separate utility changes from the individual indicator PRs.
Changelog
The archiver utility:
If a file is deleted, then instead of removing the file, write a diff file for the same file with all the row entries set to missing. This should ensure that deletions are encoded in the database. If a value is returned, this should seamlessly bring them back.Removed since this would mark many existing indicator signal values as deleted.The export utility:
changehc
, that did this without using the export util).Fixes