-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for uses of BACK that cause correlated references #269
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Feb 17, 2025
…ional support for Correlate nodes (#232) Child of #269 that is part of addressing #141. This PR deals with the hybrid & relational conversion, including the creation of new types of hybrid/relational nodes to express correlation, and also defines all of the correlation unit tests. This means that all correlated queries should produce a Relational plan, but will not be possible to convert to SQLGlot until the next PR. Subsequent PRs will deal with cases where a the relational plan needs to be de-correlated.
…sion (#234) Child PR of #269 that is part of addressing #141. Adds the SQLGlot conversion step. With these changes, all SQLGlot correlation queries are functional except for: - TPC-H queries: 5 & 22 - Correl queries: 1, 2, 3, 6, 8, 9, 15, 17, 18, 19, 20 These remainders need to be handled via the decorrelation cases in the comments of #141. Specifically: - Singular: Correl 8 - Aggregation: TPC-H Q5, TPC-H Q22, Correl 1, Correl 2, Correl 3, Correl 18, Correl 19 - Semi-Singular: Correl 9, Correl 17, Correl 20 - Semi-Aggregation: Correl 6
…ation handling (#251) Child PR of #269 that is part of addressing #141. Adds the handling of de-correlation as a post-processing step of the hybrid tree before relational conversion starts. Handles the singular & aggregate cases as described in the issue, and handles the semi-singular and semi-aggregate cases in the same manner as their non-semi equivalents (for now), saving the optimized variants for another day. These remaining still need to be addressed in the next follow up: - TPC-H Q22: error during relational conversion that likely means a bug happened during hybrid conversion or hybrid decorrelation - Correl 15: error during decorrelation causing over-excessive column pruning - Correl 18: tbd bug in decorrelation - Correl 19: bug in name resolution handling that causes the wrong term to be used when certain property names collide
…decorrelation edge cases (#254) Child PR of #269 that is part of addressing #141. Handles certain edge cases, such as: - When the de-correlation causes an aggregate to prune all of its outputs, creating an empty SELECT clause. - Fixing a bug in how decorrelation of a PARTITION child results in a cartesian join instead of joining on the partition keys. - Fixing a bug in renaming of hybrid children causing the wrong name overload to be chosen. - Fixing nondeterministic join key ordering during decorrelation. - Adjusting how BACK shifts for join/agg keys are handled so the levels takes into account levels that were skipped when snapshotting the parent. - Adjusting the SQLGlot conversion for aggregations so if there is no aggfunc, it instead does `SELECT DISTINCT` - Added 3 more correlation tests for very particular edge cases involving `PARTITION`, and also made some adjustments to the qualification/hybrid conversion of `PARTITION` to account for cases where a `PARTITION` node is the root of a child operator child access.
**Goal**: fix the bug causing PyDough such as the following to fail: ```py data = Tickers(symbol).TOP_K(5, by=symbol.ASC()) grps_a = PARTITION(data, name="child_3", by=(currency, exchange, ticker_type)) grps_b = PARTITION(grps_a, name="child_2", by=(currency, exchange)) grps_c = PARTITION(grps_b, name="child_1", by=exchange) result = grps_c.child_1.child_2.child_3 ``` The error is an `IndexError: list index out of range` in the following section of hybrid conversion: ```py case PartitionChild(): hybrid = self.make_hybrid_tree(node.ancestor_context, parent) successor_hybrid = HybridTree( HybridPartitionChild(hybrid.children[0].subtree) ) ... ``` Reason: if `node.ancestor_context` is **also** a partition child, the hybrid child subtree handling/invocation is not being done correctly. The correct way is to recursively step into `hybrid.pipeline[0].subtree` until `hybrid.pipeline[0]` is no longer a partition child. Once this happens, then we can access `children[0].subtree`, because then we have found original data being partitioned.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #141. See issue for more details. Child PRs deal with the components of the issue, ancillary changes, and additional bugfixes: