Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BSE-4155] Add support for the Join relational node #42

Merged
merged 134 commits into from
Nov 14, 2024
Merged

Conversation

njriasan
Copy link
Contributor

Adds support for the basic definition of the Join relational node.

super().__init__(data_type)
self._name: str = name
self._input_name: str | None = input_name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should allow join to generate input specific columns in the conversion using "left" or "right".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not doing full qualification (just "does it come from the LHS or RHS of the join) does that mean we must insert naming/pruning projections after every join so that the left/right sides are unified afterwards? Otherwise, if there are subsequent joins, how will we know what "left" refers to, if "left" also has its own "left" and "right"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No these are purely ephemeral for translating your input and wouldn't be represented in the output. My thought was that we just need to be able to resolve relative to the node and then lowering could optionally translate this into the SQL.

So for example, imagine we opted to have join always use left and right. Then we could execute the following steps:

  1. Assign the inputs the join the aliases left and right.
  2. Generate expressions using these aliases.
  3. Generate the output which won't depend on any table alias.

Alternatively the lowering could only generate aliases when its necessary or rather than doing left/right every time could generate unique aliases to map left/right to, possibly via a counter. The exact procedure isn't important so long as we can tell the origin for every column where it may be ambiguous.

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, but overall LGTM.

Comment on lines 17 to 21
class JoinType(Enum):
INNER = "inner"
LEFT = "left"
RIGHT = "right"
FULL_OUTER = "full outer"
Copy link
Contributor

@knassre-bodo knassre-bodo Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include semi and anti, bc PyDough will have convenience syntax for HAS and HASNOT, and we can just fast-track them to that path + deal with the specifics of the translation in the SqlGlot step.

SQLGlot does have join kinds for semi/anti, and does the EXIST translation step for us if the dialect doesn't allow them (https://github.com/tobymao/sqlglot/blob/79f67830d7d3ba92bff91eeb95b4dc8bdfa6c44e/sqlglot/dialects/sqlite.py#L199).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of what I see this looking like:

  • Suppose we have S.WHERE(HAS(T) & p1 & p2). That becomes JOIN(semi, S, T).WHERE(p1 & p2)
  • Suppose we have S.WHERE(HASNOT(T) & p1 & p2). That becomes JOIN(anti, S, T).WHERE(p1 & p2)
  • If HAS is used in any other patterns besides in the conjunction of a WHERE, it is expanded to COUNT(T) > 0, and HASNOT into COUNT(T) == 0, then translated accordingly.

We could even make the expansion happen in the AST phase, so all HAS/HASNOT present during AST->Relational translation must be semi/anti joins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this when we have the syntax.

super().__init__(data_type)
self._name: str = name
self._input_name: str | None = input_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not doing full qualification (just "does it come from the LHS or RHS of the join) does that mean we must insert naming/pruning projections after every join so that the left/right sides are unified afterwards? Otherwise, if there are subsequent joins, how will we know what "left" refers to, if "left" also has its own "left" and "right"

Base automatically changed from nick/root to main November 14, 2024 20:16
@njriasan njriasan merged commit 18617d9 into main Nov 14, 2024
4 checks passed
@njriasan njriasan deleted the nick/join branch November 14, 2024 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants