Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up PyDough collection nodes in AST module #20

Merged
merged 71 commits into from
Nov 4, 2024

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Oct 29, 2024

Initializing the AST nodes for collections, as opposed to expressions. Includes the following:

  • The collections module, which has an PyDoughCollectionAST base class and implementors TableCollection and Calc
  • Each collection can have a predecessor and an ancestor (both could be absent) to help resolve various calc/back hierarchies.
  • Added a new expression type Reference which refers to an expression (by name), of a collection context
  • Added Reference, TableCollection and Calc to the node builder & the node info classes.
  • Collection-based test info objects have the ** binary operation overloaded to be an and_then operator that takes in a subsequent collection based test info so they can be pipelined into one another by the builder (has to be ** for the associativity to work out).

Collections pertaining to a subcolleciton relationship, or other operators besides CALC, will be followup PRs.

Comment on lines 46 to 47
The properties are evaluated lazily & cached to prevent ping-ponging
between two tables that consider each other subcollections.
Copy link
Contributor Author

@knassre-bodo knassre-bodo Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter in this PR, but once the TODO item is addressed in the next PR, it will matter a lot.

Comment on lines +23 to +24
@property
def ancestor_context(self) -> Union["PyDoughCollectionAST", None]:
Copy link
Contributor Author

@knassre-bodo knassre-bodo Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing accesses, the phrase BACK(1) has access to the items from self.ancestor_context.all_terms, and BACK(2) has access to the items from self.ancestor_context.ancestor_context.all_terms, and so on.

Comment on lines 150 to 152
├─── Calc[a=[ancestor.name], b=[name], c=[MAX($2._expr1)], d=[COUNT($1)]]
├─── Where[c > 1000]
└─── OrderBy[d.DESC()]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This form is definitely growing on me, since it becomes clearer that I'm following a series of steps, some of the steps move me down from a context to a child context, what ancestor refers to is clearer here.
The Combine (while still complicated) makes more sense since it is saying "ok we just stepped down into the context of nations, we've filtered on the name not being "USA", now I want to combine this context with several more child contexts, then after ALL of them we're going to do the Calc.

@knassre-bodo knassre-bodo requested a review from njriasan October 29, 2024 20:57
ALTERNATIVE STRUCTURE
```
┌─── TableCollection[Regions]
├─── Where[ENDSWITH(name, 's')]
Copy link
Contributor Author

@knassre-bodo knassre-bodo Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A key point I brought up in our earlier meeting: when the terms in the child context (like likes 141, 150, 151, or 152) reference BACK(1) they are referencing the combined pipeline of Regions followed by the WHERE (since that pipeline could include newly defined terms from calcs). In an implementation sense, this is achieved by having all of the children's ancestor pointer be pointing to this WHERE, since it is the last thing in that pipeline. The WHERE then has access to the rest of the pipeline of elements on the same level (in this case, just Regions through predecessor pointers which give it access to all of the column properties of Regions (or any newly defined properties, if there was a CALC between Regions and this WHERE.

As far as the user is concerned regarding BACK, the combo of line 138+139 is a single unit, as is the combo of lines 140, 141, 150, 151, and 152 (142 is a special case).

Similarly, 143/144 have line 141 as their ancestor that they refer to with BACK(1), and the line 138-139 combo as their ancestor for BACK(2).

This seems complicated in the tree structure to a user, but this isn't the main user API. When you look at what this looks like in the PyDough code, it becomes far more intuitive:

x1 = Regions.WHERE(ENDSWITH(name, 's'))
x2 = x1.nations.WHERE(name != 'USA')(
            a=BACK(1).name,
            b=name,
            c=MAX(YEAR(suppliers.WHERE(STARTSWITH(phone, '415')).supply_records.lines.ship_date)),
            d=COUNT(customers.WHERE(acctbal > 0))
        ).WHERE(
            c > 1000
        ).ORDER_BY(
            d.DESC()
        )

In all of x2 (except for c/d which do more subcolleciton accesses), BACK(1) refers to x1 as a cumulative entity. The place where the BACK(n) meanings change is only with operations that cause us to make a true movement in the rich document model (like accessing .nations).

[
pytest.param(
TableCollectionInfo("Regions"),
{"key": 0, "name": 1, "comment": 2},
Copy link
Contributor Author

@knassre-bodo knassre-bodo Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calc terms have these orderings so we know exactly what order the columns would be in if we "printed" the result right then & there, and so we can ensure it is deterministic. This becomes a well-defined & relatively intuitive default behavior that can be easily augmented by different user-facing APIs.

Comment on lines +27 to +29
"name",
"key",
"comment",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: subcollections will be accessible here, but not until the followup PR.

TableCollectionInfo("Regions")
** CalcInfo(foo=LiteralInfo(42, Int64Type()), bar=ReferenceInfo("name"))
),
{"foo": 0, "bar": 1},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have access to name/key/comment, but they aren't part of the calc terms by default (since we didn't specify them in the most-recent-CALC). Note that name is used, just under a different name.

Copy link
Contributor

@njriasan njriasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knassre-bodo See my comments. I see one API that has errors which is why I withheld my approval.

Each implementation the following additional features:
- `successor`: the TestInfo for a collection node that is to be built using
the collection node built by this builder as its parent/preceding context.
- Can use the `**` operator to pipeline collection infos into one another.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure this makes it into a README somewhere. Feel free to just open a github issue.

Copy link
Contributor Author

@knassre-bodo knassre-bodo Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a README.md for the test folder. I'll add this as a TODO item in that file (we already technically have a JIRA issue for this item).

"""
The collections that the context has access to.
"""
return self._collections
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This typing isn't consistent with the type hints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is returning a dictionary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return the values? Even then the return type isn't correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this was a copy/pasting error.

self._calc_term_indices: Dict[str, Tuple[int, PyDoughExpressionAST]] | None = (
None
)
self._all_terms: Dict[str, PyDoughExpressionAST] | None = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, updating (this got scrambled a bit during the GlobalContext change)

@knassre-bodo knassre-bodo requested a review from njriasan November 4, 2024 16:57
Copy link
Contributor

@njriasan njriasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @knassre-bodo

Base automatically changed from kian/setup_expression_nodes to kian/ast_expr_collection_part_1 November 4, 2024 21:20
@knassre-bodo knassre-bodo merged commit 644e392 into kian/ast_expr_collection_part_1 Nov 4, 2024
2 checks passed
@knassre-bodo knassre-bodo deleted the kian/init_collections branch November 4, 2024 21:27
knassre-bodo added a commit that referenced this pull request Nov 5, 2024
Combination of the following PRs:
- #17
- #20
- #19
- #21
- #22
- #24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants