Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

knassre-bodo · 2025-02-11T02:07:50Z

Resolves #255. See issue for more details about the goals of the drastic overhaul. The propagation of aliases from parent to child, implicitly creating BACK references, is referred to as down-streaming. The vast majority of the changes are updates to documentaiton, notebooks, and unit tests to align with these new semantics. The collection equivalents of back-reference were deleted as they are no longer needed, and weren't fully supported to begin with.

knassre-bodo · 2025-02-12T20:35:33Z

tests/test_qualification.py

 from collections.abc import Callable

 import pytest
 from test_utils import (
    graph_fetcher,
 )
+from tpch_test_functions import (


Switched to using these instead of having the duplicates lying around, since I was getting tired of updating the duplicates.

…ition/partition_child edge cases

… for hybrid and update remainign tpch queries

…h a hack to avoid correlate generation

…d calc & back

knassre-bodo · 2025-02-13T08:23:33Z

tests/test_exploration.py

-    lps_back_lines_impl,
-    lps_back_lines_price_impl,
-    lps_back_supplier_impl,
-    lps_back_supplier_name_impl,


These examples were no longer valid

knassre-bodo · 2025-02-13T17:15:31Z

README.md

-result = People(
+result = People.CALCULATE(


We need to ensure all of these got updated, in all of our documentation, notebooks, tests, etc.

knassre-bodo · 2025-02-13T18:04:41Z

demos/notebooks/1_introduction.ipynb

@@ -101,7 +101,7 @@
   "source": [


At least 1 reviewer should re-run all 5 notebooks to confirm they behave as expected.

knassre-bodo · 2025-02-13T18:05:15Z

demos/notebooks/4_tpch.ipynb

    "    O_ORDERPRIORITY=order_priority,\n",
    "    ORDER_COUNT=COUNT(o),\n",
    ").ORDER_BY(O_ORDERPRIORITY.ASC())\n",
    "pydough.to_df(output)"
   ]
  },
+  {
+   "cell_type": "markdown",


Also added in 5/21/22 since, by the time this is merged, correlation will have already been added.

knassre-bodo · 2025-02-13T18:05:46Z

pydough/conversion/relational_converter.py

@@ -875,21 +875,23 @@ def rel_translation(
    @staticmethod
    def preprocess_root(
        node: PyDoughCollectionQDAG,
+        output_cols: list[tuple[str, str]] | None,


Uses output_cols as an override of the normal behavior.

knassre-bodo · 2025-02-13T18:05:53Z

pydough/conversion/relational_converter.py

-        rel_expr = output.expressions[hybrid_expr]
-        ordered_columns.append((original_name, rel_expr))
-    ordered_columns.sort(key=lambda col: node.get_expression_position(col[0]))
+    if columns is None:


knassre-bodo · 2025-02-13T18:06:22Z

pydough/evaluation/evaluate_unqualified.py

@@ -66,6 +66,47 @@ def _load_session_info(
    return metadata, config, database, bindings


+def _load_column_selection(kwargs: dict[str, object]) -> list[tuple[str, str]] | None:


Some of the testing changes were specifically to make sure this kwarg works as-expected (most tests just pass in None)

knassre-bodo · 2025-02-13T18:08:11Z

pydough/conversion/hybrid_tree.py

        for name in collection.calc_terms:
+            # Skip columns that are overloaded with a name from an ancestor,
+            # since they should not be used.
+            if name in collection.ancestral_mapping:
+                continue


See case Reference(): further down below for where this comes into play. Skipping them now allows us to prevent unnecessarily including an ancestral term until it is specifically requested (otherwise, we'll get tons of unnecessary correlations)

knassre-bodo · 2025-02-13T18:18:36Z

pydough/conversion/hybrid_tree.py

                hybrid.add_successor(successor_hybrid)
                self.populate_children(successor_hybrid, node, child_ref_mapping)
                partition_child_idx: int = child_ref_mapping[0]
+                back_exprs: dict[str, HybridExpr] = {}


The trick here with PARTITION_BY is that we need all of the ancestral terms that would be lot by the change-in-ancestry to be flushed downward into the current child context, that way all of them are later accessible if needed. This does not cause correlations since it is only propagating down-streamed terms from within the single vertical-spine of the child node to the Partition.

E.g. consider the following snippet:

selected_lines = parts.CALCULATE( container ).supply_records.WHERE( supplier.nation.region.name == "EUROPE" ).lines global_info = TPCH(global_avg_qty=AVG(selected_lines.quantity)) result = global_info.PARTITION( selected_lines, name="lines", by=container ).CALCULATE( container_avg_qty ).lines.WHERE( (quantity > global_avg_qty) & (quantity > container_avg_qty) )

The container term, made available for down-streaming, is never used by selected_lines. However, when it is partitioned, we first flush it downward into the terms of selected_lines so it can be used as an implicit back-reference to the original ancestry. Then, when the ancestry is changed, lines now gains access to global_avg_qty and container_avg_qty from its new ancestry. However, we do not want global_avg_qty to be similarly flushed (even though global_info is part of the original ancestry of selected_lines) because that will cause an unnecessary correlated reference. We can deal with global_avg_qty by just relying on the new ancestry to provide it for us, which is why line 1852 checks to see if it is part of the new ancestry.

knassre-bodo · 2025-02-13T18:19:20Z

pydough/exploration/explain.py

@@ -474,23 +478,6 @@ def explain_unqualified(node: UnqualifiedNode, verbose: bool) -> str:
                    "\nThe collection does not have any terms that can be included in a result if it is executed."
                )

-            # Identify the number of BACK levels that are accessible


This behavior no longer exists (and wasn't supported E2E anyway)

knassre-bodo · 2025-02-13T18:19:29Z

pydough/exploration/term.py

@@ -80,13 +79,7 @@ def collection_in_context_string(
    Returns:
        The desired string representation of context and collection combined.
    """
-    if isinstance(collection, BackReferenceCollection):


This behavior no longer exists (and wasn't supported E2E anyway)

knassre-bodo · 2025-02-13T18:19:32Z

pydough/exploration/term.py

@@ -69,8 +68,8 @@ def collection_in_context_string(
    """
    Converts a collection in the context of another collection into a single
    string in a way that elides back collection references. For example,
-    if the context is A.B.C.D, and the collection is BACK(2).E.F, the result


This behavior no longer exists (and wasn't supported E2E anyway)

knassre-bodo · 2025-02-13T18:19:40Z

pydough/qdag/README.md

-# Build a back reference collection node
-# Equivalent PyDough code: `BACK(1).subcollection`
-back_reference_collection_node = builder.build_back_reference_collection(table_collection, "subcollection", 1)


This behavior no longer exists (and wasn't supported E2E anyway)

knassre-bodo · 2025-02-17T19:59:50Z

pydough/qdag/collections/collection_access.py

+            # Create a back-reference to the ancestor term.
+            return BackReferenceExpression(
+                self, term_name, self.ancestral_mapping[term_name]
+            )


This is where the magic trick happens; when you have Customers.CALCULATE(n=...).orders, .orders has all of its own columns as well as a back-reference to n. Any operation that steps down into a new layer (collection access, partition by, partition child) has to do this trick.

knassre-bodo · 2025-02-17T20:14:18Z

tests/test_unqualified_node.py

        pytest.param(
            "answer = _ROOT.Parts",
-            "?.Parts",
+            "Parts",
            id="access_collection",
        ),


I decided that the new semantics made it so it made more sense if the repr string for the PyDough unqualified nodes was valid PyDough code, and so I removed the ?. prefixes.

knassre-bodo · 2025-02-17T23:54:12Z

NOTE: more changes will need to be made once #269 is merged.

…ed to fix correlation/decorrelation handling of new BACK form as well as update any new references to BACK/CALC

knassre-bodo · 2025-02-19T04:31:32Z

tests/test_qualification.py

@@ -473,21 +498,20 @@
  ├─┬─ AccessChild
  │ ├─── TableCollection[Customers]
  │ ├─── Calculate[cntry_code=SLICE(phone, None, 2, None)]
-  │ ├─┬─ Where[ISIN(cntry_code, ['13', '31', '23', '29', '30', '18', '17']) & HASNOT($1)]


Changes from the correlation/decorrelation PR

knassre-bodo · 2025-02-19T04:31:55Z

tests/test_qualification.py

@@ -46,6 +47,30 @@
 @pytest.mark.parametrize(
    "impl, answer_tree_str",
    [
+        pytest.param(
+            partition_as_child,


A useful edge case that came up in the stringification of PARTITION

knassre-bodo added 8 commits February 10, 2025 21:06

started mass renaming and addition of the ancestral mapping

096c752

Fixing imports, allowing reuse-under-same-name uses

19315f8

Cleaning up some tests and error handling to allow non-renaming accesses

580315a

Adjusting how the ancestral mapping works

4cb488f

Working on purge of back reference collection

6c5ff8e

Begining massive purge rename of () to .CALCULATE()

0e71711

Continuing renaming purge

7100472

Continuing the back/calc rename purge

ae7dacc

knassre-bodo commented Feb 12, 2025

View reviewed changes

knassre-bodo added 8 commits February 12, 2025 16:54

Continued fixing exploration, updated all_terms handling, fixing part…

f32de87

…ition/partition_child edge cases

Hybrid handling for new BACK semantics, need to fix partition handing…

b5fa65e

… for hybrid and update remainign tpch queries

Working on hybrid partition cases, all queries working except q11 wit…

b80205d

…h a hack to avoid correlate generation

Fixing extreme edge case bug

42acc02

Resolving merge conflicts

aee583f

Fixing defog functions

4307846

Cleanup of correlate avoiding case [RUN CI]

2942171

updated core spec docs, need to finish updating notebooks to purge ol…

9aa294a

…d calc & back

knassre-bodo changed the title ~~Overhaul BACK and CALC to use downstreaming of aliases~~ Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method Feb 13, 2025

knassre-bodo commented Feb 13, 2025

View reviewed changes

knassre-bodo requested review from vineetg3, amullerbodo, gzbodo, aramirez-bodo and J-Solano-bodo-ai February 13, 2025 08:24

knassre-bodo marked this pull request as ready for review February 13, 2025 08:25

knassre-bodo added 5 commits February 13, 2025 03:25

[RUN CI]

1beb081

Added to_sql and to_df keyword argument for columns

ddd8d3b

Testing to_sql with columns arg

c2084d0

Added to_df tests

fee822f

Updating usage doc [RUN CI]

09372ab

knassre-bodo commented Feb 13, 2025

View reviewed changes

Revising notebooks [RUN CI]

9a867fb

knassre-bodo commented Feb 13, 2025

View reviewed changes

Fixing typo

b465a34

knassre-bodo commented Feb 13, 2025

View reviewed changes

more doc fixes

20bfd0b

knassre-bodo requested a review from gjdorman February 14, 2025 17:34

knassre-bodo commented Feb 17, 2025

View reviewed changes

Resolving massive conflits after pulling in big changes from main, ne…

12e85e9

…ed to fix correlation/decorrelation handling of new BACK form as well as update any new references to BACK/CALC

knassre-bodo commented Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

knassre-bodo commented Feb 11, 2025 •

edited

Loading

knassre-bodo Feb 12, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025

knassre-bodo Feb 13, 2025

knassre-bodo Feb 13, 2025

knassre-bodo Feb 13, 2025

knassre-bodo Feb 13, 2025

knassre-bodo Feb 13, 2025

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 17, 2025

knassre-bodo Feb 17, 2025

knassre-bodo commented Feb 17, 2025

knassre-bodo Feb 19, 2025

knassre-bodo Feb 19, 2025

		@@ -66,6 +66,47 @@ def _load_session_info(
		return metadata, config, database, bindings


		def _load_column_selection(kwargs: dict[str, object]) -> list[tuple[str, str]] \| None:

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Are you sure you want to change the base?

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Conversation

knassre-bodo commented Feb 11, 2025 • edited Loading

knassre-bodo Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knassre-bodo Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

knassre-bodo Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knassre-bodo Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

knassre-bodo Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

knassre-bodo Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

knassre-bodo Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knassre-bodo commented Feb 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knassre-bodo commented Feb 11, 2025 •

edited

Loading

knassre-bodo Feb 12, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading

knassre-bodo Feb 13, 2025 •

edited

Loading