-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Google Docs Meta Data #1612
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
newest comparison code: import csv
import requests
# pull down the existing and proposed/pending versions of the signal description csv file
dev_file = "https://raw.githubusercontent.com/cmu-delphi/delphi-epidata/refs/heads/dev/src/server/endpoints/covidcast_utils/db_signals.csv"
dev = []
with requests.get(dev_file, stream=True) as req:
for row in csv.reader(req.iter_lines(decode_unicode=True)):
dev.append(row)
new_file = "https://raw.githubusercontent.com/cmu-delphi/delphi-epidata/refs/heads/bot/update-docs/src/server/endpoints/covidcast_utils/db_signals.csv"
new = []
with requests.get(new_file, stream=True) as req:
for row in csv.reader(req.iter_lines(decode_unicode=True)):
new.append(row)
# column name lists
dev_cols = set(dev[0])
new_cols = set(new[0])
both_cols = list(dev_cols.intersection(new_cols))
# get the right column number for each version of the file, based on the column name
dev_col_lookup = {c: i for i,c in enumerate(dev[0])}
new_col_lookup = {c: i for i,c in enumerate(new[0])}
# get the right row number for each version of the file, based on `(source,signal)`
dev_row_lookup = {}
for i, row in enumerate(dev):
src = row[dev_col_lookup["Source Subdivision"]]
sig = row[dev_col_lookup["Signal"]]
if (src, sig) in dev_row_lookup:
print("!!! src:sig duplicate in dev file! --", src, ":", sig)
dev_row_lookup[(src, sig)] = i
dev_signals = set(dev_row_lookup.keys())
new_row_lookup = {}
for i, row in enumerate(new):
src = row[new_col_lookup["Source Subdivision"]]
sig = row[new_col_lookup["Signal"]]
if (src, sig) in new_row_lookup:
print("!!! src:sig duplicate in new file! --", src, ":", sig)
new_row_lookup[(src, sig)] = i
new_signals = set(new_row_lookup.keys())
# print summary info
if dev[0] != new[0]:
print("column ordering changed!")
print("added columns:", sorted(list(new_cols-dev_cols)))
print("removed columns:", sorted(list(dev_cols-new_cols)))
print("# rows in dev file:", len(dev))
print("# rows in new file:", len(new))
print("row count difference:", len(new)-len(dev))
print("added signals:", sorted(list(new_signals-dev_signals)))
print("removed signals:", sorted(list(dev_signals-new_signals)))
print("\n")
# TODO: detect row reorderings
# add column names to this set as needed to ignore differences found in them (to simplify output for easier analysis)
columns_to_ignore = {"XXXXXX ignore me XXXXXX"}
both_cols = [col for col in both_cols if col not in columns_to_ignore]
# show individual changes
changes_count = 0
for i in range(len(dev)):
src = dev[i][dev_col_lookup["Source Subdivision"]]
sig = dev[i][dev_col_lookup["Signal"]]
if (src, sig) not in new_row_lookup:
# this is a removed signal so no summary is displayed
continue
dev_ln_num = i
new_ln_num = new_row_lookup[(src, sig)]
# prepare properly ordered list of values from both
dev_line = [dev[dev_ln_num][dev_col_lookup[col]] for col in both_cols]
new_line = [new[new_ln_num][new_col_lookup[col]] for col in both_cols]
if dev_line != new_line:
changes_count += 1
print("\nMISMATCH!! [", src, ":", sig, "] dev row:", dev_ln_num+1, "/ new row:", new_ln_num+1)
print("\n".join(["".join([
" ", col, ":\n ", dev[dev_ln_num][dev_col_lookup[col]], "\n -->\n ", new[new_ln_num][new_col_lookup[col]]])
for col in both_cols if dev[dev_ln_num][dev_col_lookup[col]]!=new[new_ln_num][new_col_lookup[col]]
]))
print("\n")
print("lines with changes:", changes_count)
# TODO: use f-string formatting in print() statements |
output from code above:
|
carlynvandyke
approved these changes
Feb 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, sorry it took so long to review!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updating Google Docs Meta Data
Change summary:
db_sources.csv
:nssp
db_signals.csv
:google-symptoms
signals for conjunctivitisnssp
signals for rsv and for counts of reporting hospitalsnchs-mortality
names and signal sets