Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support uses of BACK that cause correlated references: setup decorrelation handling #251

Merged
merged 148 commits into from
Feb 18, 2025

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Feb 6, 2025

Child PR of #269 that is part of addressing #141. Adds the handling of de-correlation as a post-processing step of the hybrid tree before relational conversion starts. Handles the singular & aggregate cases as described in the issue, and handles the semi-singular and semi-aggregate cases in the same manner as their non-semi equivalents (for now), saving the optimized variants for another day.

These remaining still need to be addressed in the next follow up:

  • TPC-H Q22: error during relational conversion that likely means a bug happened during hybrid conversion or hybrid decorrelation
  • Correl 15: error during decorrelation causing over-excessive column pruning
  • Correl 18: tbd bug in decorrelation
  • Correl 19: bug in name resolution handling that causes the wrong term to be used when certain property names collide

knassre-bodo and others added 30 commits January 13, 2025 11:59
Revision 2

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 3

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 4

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 5

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 6

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 7

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 8

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 9

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 10

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 11

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 12

Co-authored-by: Hadia Ahmed <[email protected]>
Revision 13

Co-authored-by: Hadia Ahmed <[email protected]>
@knassre-bodo knassre-bodo marked this pull request as ready for review February 10, 2025 20:18
Comment on lines +187 to +188
if levels == 0:
return self
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a useful trick to have rather than having to case on this behavior elsewhere, so x.shift_back(0) returns x.

@@ -895,6 +897,7 @@ def __init__(
self._is_connection_root: bool = is_connection_root
self._agg_keys: list[HybridExpr] | None = None
self._join_keys: list[tuple[HybridExpr, HybridExpr]] | None = None
self._correlated_children: set[int] = set()
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set of indices indicates which of the elements in _children contain a correlated reference specifically to self. This is important because only those children should have the de-correlation procedure run on them, and only if they are of a certain type.

Comment on lines -1603 to -1606
case HybridRefExpr():
parent_result = HybridBackRefExpr(
parent_result.expr.name, 1, parent_result.typ
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This weird extra-shifting trick is no longer needed, and was actually incorrect.

Comment on lines +1631 to +1632
if not isinstance(parent_result, HybridCorrelExpr):
parent_tree.correlated_children.add(len(parent_tree.children))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we know which children of the hybrid tree are correlated. If parent_result is a correl expr, it means that the correlated reference is actually to the tree containing parent_tree.

@@ -1,8 +1,10 @@
ROOT(columns=[('name', name), ('n_prefix_nations', n_prefix_nations)], orderings=[(ordering_1):asc_first])
PROJECT(columns={'n_prefix_nations': n_prefix_nations, 'name': name, 'ordering_1': name})
PROJECT(columns={'n_prefix_nations': DEFAULT_TO(agg_0, 0:int64), 'name': name})
JOIN(conditions=[t0.key == t1.region_key], types=['left'], columns={'agg_0': t1.agg_0, 'name': t0.name}, correl_name='corr1')
JOIN(conditions=[t0.key == t1.key], types=['left'], columns={'agg_0': t1.agg_0, 'name': t0.name})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "simple" example of how the de-correaltion works.

  • The left subtree (region) became a part of the right subtree (nation) before the correlated reference (the filter) occurred. This results in selecting region twice, but that's an efficiency problem for another day.
  • The join is now on the uniquness keys of the left subtree: the region's .key property.

Copy link
Collaborator

@amullerbodo amullerbodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general thoughts about hashes and equality.

):
# These patterns do not require decorrelation since they
# are supported via correlated SEMI/ANTI joins.
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a default "case _:" here?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because these are exhaustive with regards to the enum. If any new case is generated, mypy will throw an exception because these branches are no longer exhaustive. So here, we can rely on mypy tooling to ensure quality.

return self.equals(other)

def __hash__(self) -> int:
return hash(repr(self))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any other way we could generate this hash? If someone changes repr inadvertently somewhere else it could break things. Not sure it is something we should be worried about at this stage.

Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a better way off the top of my head since the best way to hash would be in a manner that depends on the name of the function/operator, which is always inside of its repr string (but each class handles that naming differently).

Base automatically changed from kian/correlated_backref_2 to kian/correlation_omnibus February 17, 2025 20:28
Copy link
Contributor

@vineetg3 vineetg3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few Typos, but LGTM!

@knassre-bodo knassre-bodo merged commit b304c29 into kian/correlation_omnibus Feb 18, 2025
3 checks passed
@knassre-bodo knassre-bodo deleted the kian/correlated_backref_3 branch February 18, 2025 20:39
knassre-bodo added a commit that referenced this pull request Feb 19, 2025
Resolves #141. See issue for more details. Child PRs deal with the components of the issue, ancillary changes, and additional bugfixes:
- #232
- #234 
- #251 
- #254 
- #272
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants