Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method #256

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Feb 11, 2025

Resolves #255. See issue for more details about the goals of the drastic overhaul. The propagation of aliases from parent to child, implicitly creating BACK references, is referred to as down-streaming. The vast majority of the changes are updates to documentaiton, notebooks, and unit tests to align with these new semantics. The collection equivalents of back-reference were deleted as they are no longer needed, and weren't fully supported to begin with.

from collections.abc import Callable

import pytest
from test_utils import (
graph_fetcher,
)
from tpch_test_functions import (
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using these instead of having the duplicates lying around, since I was getting tired of updating the duplicates.

@knassre-bodo knassre-bodo changed the title Overhaul BACK and CALC to use downstreaming of aliases Overhaul BACK and CALC to use downstreaming of aliases and add CALCULATE method Feb 13, 2025
Comment on lines -20 to -23
lps_back_lines_impl,
lps_back_lines_price_impl,
lps_back_supplier_impl,
lps_back_supplier_name_impl,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples were no longer valid

@knassre-bodo knassre-bodo marked this pull request as ready for review February 13, 2025 08:25
result = People(
result = People.CALCULATE(
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure all of these got updated, in all of our documentation, notebooks, tests, etc.

@@ -101,7 +101,7 @@
"source": [
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least 1 reviewer should re-run all 5 notebooks to confirm they behave as expected.

" O_ORDERPRIORITY=order_priority,\n",
" ORDER_COUNT=COUNT(o),\n",
").ORDER_BY(O_ORDERPRIORITY.ASC())\n",
"pydough.to_df(output)"
]
},
{
"cell_type": "markdown",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added in 5/21/22 since, by the time this is merged, correlation will have already been added.

@@ -875,21 +875,23 @@ def rel_translation(
@staticmethod
def preprocess_root(
node: PyDoughCollectionQDAG,
output_cols: list[tuple[str, str]] | None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses output_cols as an override of the normal behavior.

rel_expr = output.expressions[hybrid_expr]
ordered_columns.append((original_name, rel_expr))
ordered_columns.sort(key=lambda col: node.get_expression_position(col[0]))
if columns is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -66,6 +66,47 @@ def _load_session_info(
return metadata, config, database, bindings


def _load_column_selection(kwargs: dict[str, object]) -> list[tuple[str, str]] | None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the testing changes were specifically to make sure this kwarg works as-expected (most tests just pass in None)

Comment on lines 426 to +430
for name in collection.calc_terms:
# Skip columns that are overloaded with a name from an ancestor,
# since they should not be used.
if name in collection.ancestral_mapping:
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See case Reference(): further down below for where this comes into play. Skipping them now allows us to prevent unnecessarily including an ancestral term until it is specifically requested (otherwise, we'll get tons of unnecessary correlations)

hybrid.add_successor(successor_hybrid)
self.populate_children(successor_hybrid, node, child_ref_mapping)
partition_child_idx: int = child_ref_mapping[0]
back_exprs: dict[str, HybridExpr] = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick here with PARTITION_BY is that we need all of the ancestral terms that would be lot by the change-in-ancestry to be flushed downward into the current child context, that way all of them are later accessible if needed. This does not cause correlations since it is only propagating down-streamed terms from within the single vertical-spine of the child node to the Partition.

E.g. consider the following snippet:

selected_lines = parts.CALCULATE(
  container
).supply_records.WHERE(
  supplier.nation.region.name == "EUROPE"
).lines
global_info = TPCH(global_avg_qty=AVG(selected_lines.quantity))
result = global_info.PARTITION(
  selected_lines, name="lines", by=container
).CALCULATE(
  container_avg_qty
).lines.WHERE(
  (quantity > global_avg_qty) & 
  (quantity > container_avg_qty)
)

The container term, made available for down-streaming, is never used by selected_lines. However, when it is partitioned, we first flush it downward into the terms of selected_lines so it can be used as an implicit back-reference to the original ancestry. Then, when the ancestry is changed, lines now gains access to global_avg_qty and container_avg_qty from its new ancestry. However, we do not want global_avg_qty to be similarly flushed (even though global_info is part of the original ancestry of selected_lines) because that will cause an unnecessary correlated reference. We can deal with global_avg_qty by just relying on the new ancestry to provide it for us, which is why line 1852 checks to see if it is part of the new ancestry.

@@ -474,23 +478,6 @@ def explain_unqualified(node: UnqualifiedNode, verbose: bool) -> str:
"\nThe collection does not have any terms that can be included in a result if it is executed."
)

# Identify the number of BACK levels that are accessible
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior no longer exists (and wasn't supported E2E anyway)

@@ -80,13 +79,7 @@ def collection_in_context_string(
Returns:
The desired string representation of context and collection combined.
"""
if isinstance(collection, BackReferenceCollection):
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior no longer exists (and wasn't supported E2E anyway)

@@ -69,8 +68,8 @@ def collection_in_context_string(
"""
Converts a collection in the context of another collection into a single
string in a way that elides back collection references. For example,
if the context is A.B.C.D, and the collection is BACK(2).E.F, the result
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior no longer exists (and wasn't supported E2E anyway)

Comment on lines -114 to -116
# Build a back reference collection node
# Equivalent PyDough code: `BACK(1).subcollection`
back_reference_collection_node = builder.build_back_reference_collection(table_collection, "subcollection", 1)
Copy link
Contributor Author

@knassre-bodo knassre-bodo Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior no longer exists (and wasn't supported E2E anyway)

Comment on lines +116 to +119
# Create a back-reference to the ancestor term.
return BackReferenceExpression(
self, term_name, self.ancestral_mapping[term_name]
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the magic trick happens; when you have Customers.CALCULATE(n=...).orders, .orders has all of its own columns as well as a back-reference to n. Any operation that steps down into a new layer (collection access, partition by, partition child) has to do this trick.

Comment on lines 127 to 131
pytest.param(
"answer = _ROOT.Parts",
"?.Parts",
"Parts",
id="access_collection",
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided that the new semantics made it so it made more sense if the repr string for the PyDough unqualified nodes was valid PyDough code, and so I removed the ?. prefixes.

@knassre-bodo
Copy link
Contributor Author

NOTE: more changes will need to be made once #269 is merged.

…ed to fix correlation/decorrelation handling of new BACK form as well as update any new references to BACK/CALC
@@ -473,21 +498,20 @@
├─┬─ AccessChild
│ ├─── TableCollection[Customers]
│ ├─── Calculate[cntry_code=SLICE(phone, None, 2, None)]
│ ├─┬─ Where[ISIN(cntry_code, ['13', '31', '23', '29', '30', '18', '17']) & HASNOT($1)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes from the correlation/decorrelation PR

@@ -46,6 +47,30 @@
@pytest.mark.parametrize(
"impl, answer_tree_str",
[
pytest.param(
partition_as_child,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A useful edge case that came up in the stringification of PARTITION

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overhaul the way back-references and CALC work to eliminate the need for BACK syntax
1 participant