-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up PyDough collection nodes in AST module #20
Set up PyDough collection nodes in AST module #20
Conversation
The properties are evaluated lazily & cached to prevent ping-ponging | ||
between two tables that consider each other subcollections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter in this PR, but once the TODO item is addressed in the next PR, it will matter a lot.
@property | ||
def ancestor_context(self) -> Union["PyDoughCollectionAST", None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When doing accesses, the phrase BACK(1)
has access to the items from self.ancestor_context.all_terms
, and BACK(2)
has access to the items from self.ancestor_context.ancestor_context.all_terms
, and so on.
├─── Calc[a=[ancestor.name], b=[name], c=[MAX($2._expr1)], d=[COUNT($1)]] | ||
├─── Where[c > 1000] | ||
└─── OrderBy[d.DESC()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This form is definitely growing on me, since it becomes clearer that I'm following a series of steps, some of the steps move me down from a context to a child context, what ancestor
refers to is clearer here.
The Combine
(while still complicated) makes more sense since it is saying "ok we just stepped down into the context of nations
, we've filtered on the name not being "USA", now I want to combine this context with several more child contexts, then after ALL of them we're going to do the Calc.
ALTERNATIVE STRUCTURE | ||
``` | ||
┌─── TableCollection[Regions] | ||
├─── Where[ENDSWITH(name, 's')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A key point I brought up in our earlier meeting: when the terms in the child context (like likes 141, 150, 151, or 152) reference BACK(1)
they are referencing the combined pipeline of Regions
followed by the WHERE
(since that pipeline could include newly defined terms from calcs). In an implementation sense, this is achieved by having all of the children's ancestor pointer be pointing to this WHERE
, since it is the last thing in that pipeline. The WHERE
then has access to the rest of the pipeline of elements on the same level (in this case, just Regions
through predecessor pointers which give it access to all of the column properties of Regions
(or any newly defined properties, if there was a CALC
between Regions
and this WHERE
.
As far as the user is concerned regarding BACK
, the combo of line 138+139 is a single unit, as is the combo of lines 140, 141, 150, 151, and 152 (142 is a special case).
Similarly, 143/144 have line 141 as their ancestor that they refer to with BACK(1)
, and the line 138-139 combo as their ancestor for BACK(2)
.
This seems complicated in the tree structure to a user, but this isn't the main user API. When you look at what this looks like in the PyDough code, it becomes far more intuitive:
x1 = Regions.WHERE(ENDSWITH(name, 's'))
x2 = x1.nations.WHERE(name != 'USA')(
a=BACK(1).name,
b=name,
c=MAX(YEAR(suppliers.WHERE(STARTSWITH(phone, '415')).supply_records.lines.ship_date)),
d=COUNT(customers.WHERE(acctbal > 0))
).WHERE(
c > 1000
).ORDER_BY(
d.DESC()
)
In all of x2
(except for c/d which do more subcolleciton accesses), BACK(1)
refers to x1
as a cumulative entity. The place where the BACK(n)
meanings change is only with operations that cause us to make a true movement in the rich document model (like accessing .nations
).
[ | ||
pytest.param( | ||
TableCollectionInfo("Regions"), | ||
{"key": 0, "name": 1, "comment": 2}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calc terms have these orderings so we know exactly what order the columns would be in if we "printed" the result right then & there, and so we can ensure it is deterministic. This becomes a well-defined & relatively intuitive default behavior that can be easily augmented by different user-facing APIs.
"name", | ||
"key", | ||
"comment", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: subcollections will be accessible here, but not until the followup PR.
TableCollectionInfo("Regions") | ||
** CalcInfo(foo=LiteralInfo(42, Int64Type()), bar=ReferenceInfo("name")) | ||
), | ||
{"foo": 0, "bar": 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have access to name/key/comment, but they aren't part of the calc terms by default (since we didn't specify them in the most-recent-CALC). Note that name
is used, just under a different name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knassre-bodo See my comments. I see one API that has errors which is why I withheld my approval.
Each implementation the following additional features: | ||
- `successor`: the TestInfo for a collection node that is to be built using | ||
the collection node built by this builder as its parent/preceding context. | ||
- Can use the `**` operator to pipeline collection infos into one another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure this makes it into a README somewhere. Feel free to just open a github issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a README.md
for the test folder. I'll add this as a TODO item in that file (we already technically have a JIRA issue for this item).
""" | ||
The collections that the context has access to. | ||
""" | ||
return self._collections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This typing isn't consistent with the type hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is returning a dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return the values? Even then the return type isn't correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this was a copy/pasting error.
self._calc_term_indices: Dict[str, Tuple[int, PyDoughExpressionAST]] | None = ( | ||
None | ||
) | ||
self._all_terms: Dict[str, PyDoughExpressionAST] | None = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can this be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, updating (this got scrambled a bit during the GlobalContext
change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @knassre-bodo
Initializing the AST nodes for collections, as opposed to expressions. Includes the following:
collections
module, which has anPyDoughCollectionAST
base class and implementorsTableCollection
andCalc
Reference
which refers to an expression (by name), of a collection contextReference
,TableCollection
andCalc
to the node builder & the node info classes.**
binary operation overloaded to be anand_then
operator that takes in a subsequent collection based test info so they can be pipelined into one another by the builder (has to be**
for the associativity to work out).Collections pertaining to a subcolleciton relationship, or other operators besides CALC, will be followup PRs.