Add support for uses of BACK that cause correlated references #269

knassre-bodo · 2025-02-17T20:19:28Z

Resolves #141. See issue for more details. Child PRs deal with the components of the issue, ancillary changes, and additional bugfixes:

Support uses of BACK that cause correlated references: Hybrid & Relational support for Correlate nodes #232
Support uses of BACK that cause correlated references: SQLGlot conversion #234
Support uses of BACK that cause correlated references: setup decorrelation handling #251
Support uses of BACK that cause correlated references: fix remaining decorrelation edge cases #254
Fix bug with accessing multiple chained PARTITION children #272

…ional support for Correlate nodes (#232) Child of #269 that is part of addressing #141. This PR deals with the hybrid & relational conversion, including the creation of new types of hybrid/relational nodes to express correlation, and also defines all of the correlation unit tests. This means that all correlated queries should produce a Relational plan, but will not be possible to convert to SQLGlot until the next PR. Subsequent PRs will deal with cases where a the relational plan needs to be de-correlated.

…sion (#234) Child PR of #269 that is part of addressing #141. Adds the SQLGlot conversion step. With these changes, all SQLGlot correlation queries are functional except for: - TPC-H queries: 5 & 22 - Correl queries: 1, 2, 3, 6, 8, 9, 15, 17, 18, 19, 20 These remainders need to be handled via the decorrelation cases in the comments of #141. Specifically: - Singular: Correl 8 - Aggregation: TPC-H Q5, TPC-H Q22, Correl 1, Correl 2, Correl 3, Correl 18, Correl 19 - Semi-Singular: Correl 9, Correl 17, Correl 20 - Semi-Aggregation: Correl 6

…ation handling (#251) Child PR of #269 that is part of addressing #141. Adds the handling of de-correlation as a post-processing step of the hybrid tree before relational conversion starts. Handles the singular & aggregate cases as described in the issue, and handles the semi-singular and semi-aggregate cases in the same manner as their non-semi equivalents (for now), saving the optimized variants for another day. These remaining still need to be addressed in the next follow up: - TPC-H Q22: error during relational conversion that likely means a bug happened during hybrid conversion or hybrid decorrelation - Correl 15: error during decorrelation causing over-excessive column pruning - Correl 18: tbd bug in decorrelation - Correl 19: bug in name resolution handling that causes the wrong term to be used when certain property names collide

…decorrelation edge cases (#254) Child PR of #269 that is part of addressing #141. Handles certain edge cases, such as: - When the de-correlation causes an aggregate to prune all of its outputs, creating an empty SELECT clause. - Fixing a bug in how decorrelation of a PARTITION child results in a cartesian join instead of joining on the partition keys. - Fixing a bug in renaming of hybrid children causing the wrong name overload to be chosen. - Fixing nondeterministic join key ordering during decorrelation. - Adjusting how BACK shifts for join/agg keys are handled so the levels takes into account levels that were skipped when snapshotting the parent. - Adjusting the SQLGlot conversion for aggregations so if there is no aggfunc, it instead does `SELECT DISTINCT` - Added 3 more correlation tests for very particular edge cases involving `PARTITION`, and also made some adjustments to the qualification/hybrid conversion of `PARTITION` to account for cases where a `PARTITION` node is the root of a child operator child access.

**Goal**: fix the bug causing PyDough such as the following to fail: ```py data = Tickers(symbol).TOP_K(5, by=symbol.ASC()) grps_a = PARTITION(data, name="child_3", by=(currency, exchange, ticker_type)) grps_b = PARTITION(grps_a, name="child_2", by=(currency, exchange)) grps_c = PARTITION(grps_b, name="child_1", by=exchange) result = grps_c.child_1.child_2.child_3 ``` The error is an `IndexError: list index out of range` in the following section of hybrid conversion: ```py case PartitionChild(): hybrid = self.make_hybrid_tree(node.ancestor_context, parent) successor_hybrid = HybridTree( HybridPartitionChild(hybrid.children[0].subtree) ) ... ``` Reason: if `node.ancestor_context` is **also** a partition child, the hybrid child subtree handling/invocation is not being done correctly. The correct way is to recursively step into `hybrid.pipeline[0].subtree` until `hybrid.pipeline[0]` is no longer a partition child. Once this happens, then we can access `children[0].subtree`, because then we have found original data being partitioned.

Pulling in minor child change

051100b

knassre-bodo added 2 commits February 17, 2025 15:26

knassre-bodo mentioned this pull request Feb 17, 2025

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Open

knassre-bodo and others added 4 commits February 18, 2025 15:39

[RUN CI]

269fe28

knassre-bodo merged commit 947ca93 into main Feb 19, 2025
5 checks passed

knassre-bodo deleted the kian/correlation_omnibus branch February 19, 2025 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for uses of BACK that cause correlated references #269

Add support for uses of BACK that cause correlated references #269

knassre-bodo commented Feb 17, 2025 •

edited

Loading

Add support for uses of BACK that cause correlated references #269

Add support for uses of BACK that cause correlated references #269

Conversation

knassre-bodo commented Feb 17, 2025 • edited Loading

knassre-bodo commented Feb 17, 2025 •

edited

Loading