-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support uses of BACK that cause correlated references: fix remaining decorrelation edge cases #254
Support uses of BACK that cause correlated references: fix remaining decorrelation edge cases #254
Conversation
Co-authored-by: Hadia Ahmed <[email protected]>
Revision 2 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 3 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 4 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 5 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 6 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 7 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 8 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 9 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 10 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 11 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 12 Co-authored-by: Hadia Ahmed <[email protected]>
Revision 13 Co-authored-by: Hadia Ahmed <[email protected]>
while ( | ||
used_name in terms | ||
or used_name in renamings | ||
or used_name in new_renamings | ||
): | ||
used_name = f"{name}_{idx}" | ||
idx += 1 | ||
terms[used_name] = expr | ||
renamings[name] = used_name | ||
new_renamings[name] = used_name | ||
renamings.update(new_renamings) | ||
for old_name, new_name in new_renamings.items(): | ||
expr = new_expressions.pop(old_name) | ||
new_expressions[new_name] = expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allows us to ensure that the terms in new_expressions
also get renamed when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked the changes looks good from my end maybe I would find time to add some more asserts around the code base in general. Great work!
@@ -62,9 +64,10 @@ def make_decorrelate_parent( | |||
# case, all of the parent's children & pipeline operators should be | |||
# included in the snapshot. | |||
assert hybrid.parent is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any other asserts we should write here regarding hybrid.parent ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really since the if statement above should cover it. In general, I only uses assertions when they are needed for mypy to not ocmplain about the typing (e.g. w/o this assertion, the next line would fail because hybrid.parent.children
isn't legal if the type of hybrid.parent
could be None).
… changed GROUP BY without aggfuncs to DISTINCT
if aggregations: | ||
query = query.group_by(*keys) | ||
else: | ||
query = query.distinct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so if we have an aggregate on keys A, B, C
w/o any aggregation functions, we can just do SELECT DISTINCT A, B, C FROM (...)
JOIN(conditions=[t0.key_9 == t1.key_21 & t0.order_key == t1.order_key & t0.line_number == t1.line_number & t0.key_5 == t1.key_17 & t0.key_2 == t1.key_14 & t0.key == t1.key], types=['left'], columns={'account_balance': t0.account_balance, 'domestic': t1.domestic}) | ||
JOIN(conditions=[t0.key_9 == t1.key_21 & t0.line_number == t1.line_number & t0.order_key == t1.order_key & t0.key_5 == t1.key_17 & t0.key_2 == t1.key_14 & t0.key == t1.key], types=['left'], columns={'account_balance': t0.account_balance, 'domestic': t1.domestic}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a change in the ordering, caused by sorting to ensure determinism of the plans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
JOIN(conditions=[True:bool], types=['inner'], columns={'customer_key': t0.customer_key, 'order_date': t0.order_date, 'order_date_2': t1.order_date, 'total_price': t0.total_price, 'total_price_3': t1.total_price}) | ||
JOIN(conditions=[t0.customer_key == t1.customer_key & t0.order_date == t1.order_date], types=['inner'], columns={'customer_key': t0.customer_key, 'order_date': t0.order_date, 'order_date_2': t1.order_date, 'total_price': t0.total_price, 'total_price_3': t1.total_price}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a huge bug previously and is fixed by the main block of changes in relational_converter.py
. This was the original code:
cust_date_groups = PARTITION(
Orders.WHERE(YEAR(order_date) == 1993),
name="o",
by=(customer_key, order_date),
)
selected_groups = cust_date_groups.WHERE(COUNT(o) > 1)(
total_price=SUM(o.total_price),
)(n_above_avg=COUNT(o.WHERE(total_price >= 0.5 * BACK(1).total_price))) TPCH(n=SUM(selected_groups.n_above_avg))
The problem arises in joining cust_date_groups
back to Orders.WHERE(YEAR(order_date) == 1993)
(in order to derive COUNT(o.WHERE(...))
. In this case, we need to re-derive Orders.WHERE(YEAR(order_date) == 1993)
but this time it gets joined onto the aggregated data, therefore Orders.WHERE(YEAR(order_date) == 1993)
has join keys to link it to the ancestor (described by preceding_hybrid[0].pipeline[0].unique_exprs
).
Child PR of #269 that is part of addressing #141. Handles certain edge cases, such as:
SELECT DISTINCT
PARTITION
, and also made some adjustments to the qualification/hybrid conversion ofPARTITION
to account for cases where aPARTITION
node is the root of a child operator child access.