-
Notifications
You must be signed in to change notification settings - Fork 3
Adding optimization rewrite pass to utilize server with information about masked columns #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 18 commits
4d6488c
5369379
36cab6e
ed6650c
cc2bbed
a6d4b29
beadb15
1b4bcac
5ea82f1
f0f512c
54ecef1
557aaeb
6b109d9
1377916
891c472
7d7580b
62db4bf
c9f6a59
7c37110
127244f
1f2dc6d
2864e4a
b278f9b
dcbb69c
feabd8a
940dd16
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -0,0 +1,312 @@ | ||||
| """ | ||||
| Logic for the visitor that is run across all expressions to identify candidates | ||||
| for Mask Server rewrite conversion. | ||||
| """ | ||||
|
|
||||
| __all__ = ["MaskServerCandidateVisitor"] | ||||
|
|
||||
| import pydough.pydough_operators as pydop | ||||
| from pydough.relational import ( | ||||
| CallExpression, | ||||
| ColumnReference, | ||||
| CorrelatedReference, | ||||
| LiteralExpression, | ||||
| RelationalExpression, | ||||
| RelationalExpressionVisitor, | ||||
| WindowCallExpression, | ||||
| ) | ||||
| from pydough.types import UnknownType | ||||
|
|
||||
|
|
||||
| class MaskServerCandidateVisitor(RelationalExpressionVisitor): | ||||
| """ | ||||
| A relational expression visitor that identifies candidate expressions for | ||||
| Mask Server rewrite conversion, and stores them in a candidate pool for | ||||
| later processing by a `MaskServerRewriteShuttle`. The candidate pool | ||||
| contains expressions with the following criteria, including both | ||||
| atomic instances of the patterns, and larger expressions that contain | ||||
| these patterns as sub-expressions: | ||||
| 1. An expression that contains exactly one unique unmasking operator (i.e. a | ||||
| `MaskedExpressionFunctionOperator` with `is_unmask=True`). The contents | ||||
| of the unmasking operator can be any valid expression. | ||||
| 2. Literals are allowed anywhere in the expression. | ||||
| 3. No other expressions are allowed (outside the contents of the unmasking | ||||
| operator) except for function calls used to combine other valid | ||||
| expressions, where the function calls must be one of the operators | ||||
| supported by the Mask Server (see `OPERATORS_TO_SERVER_NAMES`, as well as | ||||
| the `ISIN` operator). | ||||
| """ | ||||
|
|
||||
| OPERATORS_TO_SERVER_NAMES: dict[pydop.PyDoughExpressionOperator, str] = { | ||||
| pydop.BAN: "AND", | ||||
| pydop.BOR: "OR", | ||||
| pydop.NOT: "NOT", | ||||
| pydop.EQU: "EQUAL", | ||||
| pydop.NEQ: "NOT_EQUAL", | ||||
| pydop.GRT: "GT", | ||||
| pydop.GEQ: "GTE", | ||||
| pydop.LET: "LT", | ||||
| pydop.LEQ: "LTE", | ||||
| pydop.STARTSWITH: "STARTSWITH", | ||||
| pydop.ENDSWITH: "ENDSWITH", | ||||
| pydop.LOWER: "LOWER", | ||||
| pydop.UPPER: "UPPER", | ||||
| pydop.MONOTONIC: "BETWEEN", | ||||
| pydop.YEAR: "YEAR", | ||||
| pydop.MONTH: "MONTH", | ||||
| pydop.DAY: "DAY", | ||||
| pydop.ADD: "ADD", | ||||
| pydop.SUB: "SUB", | ||||
| pydop.MUL: "MUL", | ||||
| pydop.DIV: "DIV", | ||||
| } | ||||
| """ | ||||
| A mapping of all PyDough operators that can be handled by the Mask Server, | ||||
| mapping each such operator to the string name used in the linear string | ||||
| serialization format recognized by the Mask Server. | ||||
| Note: ISIN is handled separately. | ||||
| """ | ||||
|
|
||||
| def __init__(self) -> None: | ||||
| self.candidate_pool: dict[ | ||||
| RelationalExpression, | ||||
| tuple[ | ||||
| pydop.MaskedExpressionFunctionOperator, | ||||
| RelationalExpression, | ||||
| list[str | int | float | None | bool], | ||||
| ], | ||||
| ] = {} | ||||
| """ | ||||
| The internal datastructure used to keep track of all candidate | ||||
| expressions identified during a traversal of a relational tree. Each | ||||
| candidate expression maps to a tuple of: | ||||
| 1. The single unmasking operator contained within the expression. | ||||
| 2. The input expression that is being unmasked. | ||||
| 3. The linear serialization of the entire expression as a list, where | ||||
| invocations of UNMASK(input_expr) are replaced with the token | ||||
| "__col__". | ||||
| """ | ||||
|
|
||||
| self.processed_candidates: set[RelationalExpression] = set() | ||||
| """ | ||||
| The set of all relational expressions that have already been added to | ||||
| the candidate pool at lest once. This is used to avoid adding the same | ||||
| candidate multiple times if it is encountered multiple times during a | ||||
| traversal of the relational tree, since the candidate pool will be | ||||
| cleared once all of the candidates in the pool are processed in a batch | ||||
| request to the mask server. | ||||
| """ | ||||
|
|
||||
| self.stack: list[ | ||||
| tuple[ | ||||
| tuple[pydop.MaskedExpressionFunctionOperator, RelationalExpression] | ||||
| | None, | ||||
| list[str | int | float | None | bool] | None, | ||||
| ] | ||||
| ] = [] | ||||
| """ | ||||
| The stack is used to keep track of information relating to | ||||
| sub-expressions of the current expression. When visiting an expression, | ||||
| the stack will contain one entry for each input to the expression, | ||||
| where each entry is a tuple of: | ||||
| 1. Either None, or the single unmasking operator and input expression | ||||
| contained within the input expression, if any. | ||||
| 2. Either None, or the linear serialization of the input expression as | ||||
| a list, where invocations of UNMASK(input_expr) are replaced with | ||||
| the token "__col__". | ||||
| """ | ||||
|
|
||||
| def reset(self): | ||||
| self.stack.clear() | ||||
|
|
||||
| def visit_call_expression(self, expr: CallExpression) -> None: | ||||
| # TODO: ADD COMMENTS | ||||
|
||||
| # TODO: ADD COMMENTS |
knassre-bodo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused regarding the operators. Are we using the mask_server operator (IN and NOT_IN) or ISIN? I see sometimes IN and other times ISIN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we convert PyDough relational terms to the mask server terms. Specifically, we are converting a call of ISIN to to the mask server operator IN. If the PyDough code is NOT(ISIN(...)), then the list returned by this function will get wrapped in a NOT call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these the operators used in the mock server? If so, we should add
INandNOT_IN(can be found in the lookup table)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the operators used in the real server (which the mock server should emulate). And the point isn't to include all of their operators (e.g. we don't do regex), its to include all of the mappings from our operators to theirs.
ISINis handled separately from this mapping, and we don't currently useNOT_ISINat all, we just doISINand sometimes wrap the result in aNOTcall. There is no operator in PyDough which maps toNOT_ISIN.