Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support uses of BACK that cause correlated references: fix remaining decorrelation edge cases #254

Merged
merged 166 commits into from
Feb 18, 2025

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Feb 10, 2025

Child PR of #269 that is part of addressing #141. Handles certain edge cases, such as:

  • When the de-correlation causes an aggregate to prune all of its outputs, creating an empty SELECT clause.
  • Fixing a bug in how decorrelation of a PARTITION child results in a cartesian join instead of joining on the partition keys.
  • Fixing a bug in renaming of hybrid children causing the wrong name overload to be chosen.
  • Fixing nondeterministic join key ordering during decorrelation.
  • Adjusting how BACK shifts for join/agg keys are handled so the levels takes into account levels that were skipped when snapshotting the parent.
  • Adjusting the SQLGlot conversion for aggregations so if there is no aggfunc, it instead does SELECT DISTINCT
  • Added 3 more correlation tests for very particular edge cases involving PARTITION, and also made some adjustments to the qualification/hybrid conversion of PARTITION to account for cases where a PARTITION node is the root of a child operator child access.

knassre-bodo and others added 30 commits January 13, 2025 11:59
Revision 2

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 3

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 4

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 5

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 6

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 7

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 8

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 9

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 10

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 11

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 12

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 13

Co-authored-by: Hadia Ahmed <[email protected]>
Comment on lines +504 to +516
while (
used_name in terms
or used_name in renamings
or used_name in new_renamings
):
used_name = f"{name}_{idx}"
idx += 1
terms[used_name] = expr
renamings[name] = used_name
new_renamings[name] = used_name
renamings.update(new_renamings)
for old_name, new_name in new_renamings.items():
expr = new_expressions.pop(old_name)
new_expressions[new_name] = expr
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change allows us to ensure that the terms in new_expressions also get renamed when necessary.

Copy link
Collaborator

@amullerbodo amullerbodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked the changes looks good from my end maybe I would find time to add some more asserts around the code base in general. Great work!

@@ -62,9 +64,10 @@ def make_decorrelate_parent(
# case, all of the parent's children & pipeline operators should be
# included in the snapshot.
assert hybrid.parent is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any other asserts we should write here regarding hybrid.parent ?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really since the if statement above should cover it. In general, I only uses assertions when they are needed for mypy to not ocmplain about the typing (e.g. w/o this assertion, the next line would fail because hybrid.parent.children isn't legal if the type of hybrid.parent could be None).

Base automatically changed from kian/correlated_backref_3 to kian/correlation_omnibus February 18, 2025 20:39
Comment on lines +477 to +480
if aggregations:
query = query.group_by(*keys)
else:
query = query.distinct()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so if we have an aggregate on keys A, B, C w/o any aggregation functions, we can just do SELECT DISTINCT A, B, C FROM (...)

JOIN(conditions=[t0.key_9 == t1.key_21 & t0.order_key == t1.order_key & t0.line_number == t1.line_number & t0.key_5 == t1.key_17 & t0.key_2 == t1.key_14 & t0.key == t1.key], types=['left'], columns={'account_balance': t0.account_balance, 'domestic': t1.domestic})
JOIN(conditions=[t0.key_9 == t1.key_21 & t0.line_number == t1.line_number & t0.order_key == t1.order_key & t0.key_5 == t1.key_17 & t0.key_2 == t1.key_14 & t0.key == t1.key], types=['left'], columns={'account_balance': t0.account_balance, 'domestic': t1.domestic})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a change in the ordering, caused by sorting to ensure determinism of the plans.

Copy link
Contributor

@vineetg3 vineetg3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

JOIN(conditions=[True:bool], types=['inner'], columns={'customer_key': t0.customer_key, 'order_date': t0.order_date, 'order_date_2': t1.order_date, 'total_price': t0.total_price, 'total_price_3': t1.total_price})
JOIN(conditions=[t0.customer_key == t1.customer_key & t0.order_date == t1.order_date], types=['inner'], columns={'customer_key': t0.customer_key, 'order_date': t0.order_date, 'order_date_2': t1.order_date, 'total_price': t0.total_price, 'total_price_3': t1.total_price})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a huge bug previously and is fixed by the main block of changes in relational_converter.py. This was the original code:

cust_date_groups = PARTITION(
        Orders.WHERE(YEAR(order_date) == 1993),
        name="o",
        by=(customer_key, order_date),
)
selected_groups = cust_date_groups.WHERE(COUNT(o) > 1)(
        total_price=SUM(o.total_price),
)(n_above_avg=COUNT(o.WHERE(total_price >= 0.5 * BACK(1).total_price))) TPCH(n=SUM(selected_groups.n_above_avg))

The problem arises in joining cust_date_groups back to Orders.WHERE(YEAR(order_date) == 1993) (in order to derive COUNT(o.WHERE(...)). In this case, we need to re-derive Orders.WHERE(YEAR(order_date) == 1993) but this time it gets joined onto the aggregated data, therefore Orders.WHERE(YEAR(order_date) == 1993) has join keys to link it to the ancestor (described by preceding_hybrid[0].pipeline[0].unique_exprs).

@knassre-bodo knassre-bodo merged commit 7e3de91 into kian/correlation_omnibus Feb 18, 2025
3 checks passed
@knassre-bodo knassre-bodo deleted the kian/correlated_backref_4 branch February 18, 2025 22:44
knassre-bodo added a commit that referenced this pull request Feb 19, 2025
Resolves #141. See issue for more details. Child PRs deal with the components of the issue, ancillary changes, and additional bugfixes:
- #232
- #234 
- #251 
- #254 
- #272
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants