Skip to content

Update archiver and export utils for nancodes and deletion-handling #1252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 28, 2021

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Sep 15, 2021

Description

Update the archiver utility and the export utility for nancode columns. This PR is meant to separate utility changes from the individual indicator PRs.

Changelog

The archiver utility:

  • Assumes files diffed have a particular column set. Expand this column set. If a file with missing columns is compared with a file without them, then all the rows are considered new.
  • If a row is deleted, then replace it with nans for the values fields and add Nans.DELETED missing codes.
  • If a file is deleted, then instead of removing the file, write a diff file for the same file with all the row entries set to missing. This should ensure that deletions are encoded in the database. If a value is returned, this should seamlessly bring them back. Removed since this would mark many existing indicator signal values as deleted.

The export utility:

  • Adds support for new missing columns.
  • Adds default values for missing columns.
  • Adds logger actions when default values are missing.
  • Adds a logger field to create_export_csv so logging.
  • Adds a flag to create_export_csv that will write an empty csv file for every missing day between start and end date (useful for some indicators, such as changehc, that did this without using the export util).

Fixes

dshemetov and others added 8 commits May 7, 2021 15:43
* update export utility to export, validate, and test the missing cols
* handle deleted rows: replaced with nan values
* handle deleted files: replace with an empty CSV file
* handle comparisons between CSVs with/without missing cols
@dshemetov dshemetov requested a review from jingjtang September 15, 2021 22:18
@dshemetov dshemetov changed the title Update archiver and export utils for nancodes Update archiver and export utils for nancodes and deletion-handling Sep 15, 2021
Copy link
Contributor

@jingjtang jingjtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 21, 2021

FYI it is expected that the first run of acquisition uploads will be quite large after this gets merged (the .csv will contain 3 new columns, so all will be new).

@dshemetov
Copy link
Contributor Author

Also, want to make sure that my archiver changes sound reasonable @krivard, specifically replacing deleted rows with nan rows and replacing deleted files with nan rows.

@krivard
Copy link
Contributor

krivard commented Sep 22, 2021

replacing deleted rows with nan rows

Correct

replacing deleted files with nan rows

Extremely incorrect; most of our indicators only generate files for the most recent few days/weeks and not the full timeline. If not-producing those days gets interpreted as deletion, we'll eventually delete the whole timeline by accident.

@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 22, 2021

Hm, I see what you're saying. I have a feeling of inconsistency about this though.

Currently the archiver treats the absence of a file exactly the same as if an exact copy of the file in the cache was produced. In most cases, this won't be an issue, since this will result in no acquisition update for that day and that's generally what we want it to do in that case. But in the case of an actual whole-day deletion, our archiver would miss this.

I think I'm coming around to seeing that this is probably ok, since whole-day deletion is unlikely. Just wanted to put my finger on why I sensed an inconsistency.

I can take out the deleted file nan-replacement part and we can trust that deletions will occur in select geos, on select days, not in all geos, on select days.

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(fixing comments)

also is there a unit test here that ensures that indicators that don't yet publish the missing_* columns won't crash export/archive?

@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 27, 2021

(fixing comments)

also is there a unit test here that ensures that indicators that don't yet publish the missing_* columns won't crash export/archive?

@krivard In test_archive.py, csv4 and csv5 test archiver comparisons between a file without missing columns and a file with them, in both directions (updated one doesn't have missing, updated one does have missing). The tests check that both files show up in the diffs section (if a file has missing columns and the other doesn't, all rows are treated as new).

In test_export.py, we have DFs with and without missing columns. Export won't throw errors if some columns are missing, it just filters down to the ones we expect to see. So it will export old-style CSVs normally.

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@krivard krivard merged commit b5825a8 into cmu-delphi:main Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants