Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve universe target issues #371

Merged
merged 19 commits into from
Nov 17, 2017
Merged

Solve universe target issues #371

merged 19 commits into from
Nov 17, 2017

Conversation

antoniocarlon
Copy link
Contributor

@antoniocarlon antoniocarlon commented Nov 9, 2017

  • Restored universe target
  • Restored universe target for median and average aggregated measurements
  • Added a check to avoid having median and average aggregated measurements without universe target
  • Fixing other minor issues

Fixes #326

Note for the Acceptance: BLS (US), Australia and Canada (NHS and census) had some median and average aggregated measurements without universe target

@CartoDB CartoDB deleted a comment from houndci-bot Nov 10, 2017
@CartoDB CartoDB deleted a comment from houndci-bot Nov 10, 2017
@CartoDB CartoDB deleted a comment from houndci-bot Nov 10, 2017
@CartoDB CartoDB deleted a comment from houndci-bot Nov 10, 2017
@CartoDB CartoDB deleted a comment from houndci-bot Nov 10, 2017
@CartoDB CartoDB deleted a comment from houndci-bot Nov 10, 2017
Copy link

@javitonino javitonino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Questions:

  1. What do you mean by "Note for the Acceptance: BLS (US) and Canada (NHS and census) had some median and average aggregated measurements without universe target"? I guess that "had" means that it's fixed in this PR?
  2. I don't think we should deploy a dump with universes until we have the extensions changes ready.

tasks/util.py Outdated
"WHERE t.tablename = '{table}' "
"AND c.aggregate IN ('average', 'median') "
"AND (reltype IS NULL OR reltype <> 'universe')".format(
table=self.output()._tablename)).fetchall()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this query also returns an error if an average/median column has a denominator. Does this case make sense?

I think it does (dividing an average of a class over the average of the total population, e.g: average income of engineers / average income).

Copy link
Contributor Author

@antoniocarlon antoniocarlon Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there isn't any case like this in the ETL but it makes sense, having a median/average with both a universe and a denominator shouldn't trigger the check.

tasks/util.py Outdated
@@ -1665,6 +1669,23 @@ def check_null_columns(self):
raise ValueError('The following columns of the table "{table}" contain only NULL values: {columns}'.format(
table=self.output().table, columns=', '.join([x[0] for x in result])))

def check_universe_in_aggregations(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I see test changes, why don't we try to add a test for this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test forcing the ValueError to raise when there is a median or an average aggregation without universe denominator

@antoniocarlon
Copy link
Contributor Author

  1. This PR should fix all the median and average aggregated measurements without universe target. The note for the acceptance tries to help about what to test
  2. There's no need for changes in the extension as I have updated the OBS_Meta generation (also, see the comments here)

tasks/carto.py Outdated
AND NOT EXISTS (
SELECT 1 FROM agg_wo_universe u
WHERE u.id = numer_c.id
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also tested the possible loss of performance of adding this check and it's insignificant

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see. Not sure if we want to keep this in this query as it is redundant with the tasks check. My concern is not performance but readability, my head almost exploded reading this. If you prefer to keep it, let's add some comments, at least.

tasks/carto.py Outdated
AND NOT EXISTS (
SELECT 1 FROM agg_wo_universe u
WHERE u.id = numer_c.id
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see. Not sure if we want to keep this in this query as it is redundant with the tasks check. My concern is not performance but readability, my head almost exploded reading this. If you prefer to keep it, let's add some comments, at least.

tasks/carto.py Outdated
@@ -490,6 +500,10 @@ class OBSMeta(Task):
AND numer_ctag.column_id = numer_c.id
AND numer_ctag.tag_id = numer_tag.id
AND numer_c.id = leftjoined_denoms.all_numer_id
AND NOT EXISTS (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to add a check to remove the non-denominated versions of the universe numerators:

AND NOT (numer_agg IN ('median', 'average') AND denom_id IS NULL) or something along those lines.

Edit: There are no NULL denom_id entries in the obs_meta table (except if a column has no denominators at all). This idea doesn't work.

tasks/util.py Outdated
"WHERE t.tablename = '{table}' "
"AND c.aggregate IN ('average', 'median') "
"GROUP BY 1, 2 "
"HAVING LOWER(STRING_AGG(COALESCE(cc.reltype,''), ',')) NOT LIKE '%universe%'".format(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make this prettier with ARRAY_AGG and ANY

Copy link

@javitonino javitonino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let's go test it!

@javitonino
Copy link

javitonino commented Nov 17, 2017

Like a charm, merging.

Small issues:

  • au.data is very slow generating columns due to the yields. It seems a return+list comprehension is faster
  • I manually reduced the version for some columns in order to reimport, since this was missing some obscure dumps (eg: ACS quantiles need to be bumped in order to be able to easily run the normal columns).

Also, I run into the slow meta generation again, fixed it with an analyze. See #311

@javitonino javitonino merged commit 7ab1c37 into master Nov 17, 2017
@javitonino javitonino deleted the 326_Bring_back_the_universe branch November 17, 2017 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants