-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solve universe target issues #371
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Questions:
- What do you mean by "Note for the Acceptance: BLS (US) and Canada (NHS and census) had some median and average aggregated measurements without universe target"? I guess that "had" means that it's fixed in this PR?
- I don't think we should deploy a dump with universes until we have the extensions changes ready.
tasks/util.py
Outdated
"WHERE t.tablename = '{table}' " | ||
"AND c.aggregate IN ('average', 'median') " | ||
"AND (reltype IS NULL OR reltype <> 'universe')".format( | ||
table=self.output()._tablename)).fetchall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this query also returns an error if an average/median column has a denominator. Does this case make sense?
I think it does (dividing an average of a class over the average of the total population, e.g: average income of engineers / average income).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, there isn't any case like this in the ETL but it makes sense, having a median/average with both a universe and a denominator shouldn't trigger the check.
tasks/util.py
Outdated
@@ -1665,6 +1669,23 @@ def check_null_columns(self): | |||
raise ValueError('The following columns of the table "{table}" contain only NULL values: {columns}'.format( | |||
table=self.output().table, columns=', '.join([x[0] for x in result]))) | |||
|
|||
def check_universe_in_aggregations(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see test changes, why don't we try to add a test for this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test forcing the ValueError
to raise when there is a median
or an average
aggregation without universe
denominator
|
tasks/carto.py
Outdated
AND NOT EXISTS ( | ||
SELECT 1 FROM agg_wo_universe u | ||
WHERE u.id = numer_c.id | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also tested the possible loss of performance of adding this check and it's insignificant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see. Not sure if we want to keep this in this query as it is redundant with the tasks check. My concern is not performance but readability, my head almost exploded reading this. If you prefer to keep it, let's add some comments, at least.
tasks/carto.py
Outdated
AND NOT EXISTS ( | ||
SELECT 1 FROM agg_wo_universe u | ||
WHERE u.id = numer_c.id | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see. Not sure if we want to keep this in this query as it is redundant with the tasks check. My concern is not performance but readability, my head almost exploded reading this. If you prefer to keep it, let's add some comments, at least.
tasks/carto.py
Outdated
@@ -490,6 +500,10 @@ class OBSMeta(Task): | |||
AND numer_ctag.column_id = numer_c.id | |||
AND numer_ctag.tag_id = numer_tag.id | |||
AND numer_c.id = leftjoined_denoms.all_numer_id | |||
AND NOT EXISTS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to add a check to remove the non-denominated versions of the universe numerators:
AND NOT (numer_agg IN ('median', 'average') AND denom_id IS NULL)
or something along those lines.
Edit: There are no NULL denom_id entries in the obs_meta
table (except if a column has no denominators at all). This idea doesn't work.
tasks/util.py
Outdated
"WHERE t.tablename = '{table}' " | ||
"AND c.aggregate IN ('average', 'median') " | ||
"GROUP BY 1, 2 " | ||
"HAVING LOWER(STRING_AGG(COALESCE(cc.reltype,''), ',')) NOT LIKE '%universe%'".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make this prettier with ARRAY_AGG
and ANY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, let's go test it!
Like a charm, merging. Small issues:
Also, I run into the slow meta generation again, fixed it with an analyze. See #311 |
universe
targetuniverse
target formedian
andaverage
aggregated measurementsmedian
andaverage
aggregated measurements withoutuniverse
targetFixes #326
Note for the Acceptance: BLS (US), Australia and Canada (NHS and census) had some
median
andaverage
aggregated measurements withoutuniverse
target