-
Notifications
You must be signed in to change notification settings - Fork 3
Adding warning logs for potentially unsafe masking/unmasking calls #447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: kian/mask_server_rewrite
Are you sure you want to change the base?
Conversation
| "MASK": set(), | ||
| "UNMASK": {"CRBNK.ACCOUNTS.a_key"}, | ||
| }, | ||
| id="cryptbank_bubbleprop_01", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These "bubbleprop" tests all serve a specific purpose for the warning tracking: they all cover different kinds of situations where an expression is defined in one place that depends on unmask calls, gets propagated further along in the plan, then later gets used in a sensitive manner. Specifically, all of them are a window function where the inputs are sensitive columns (sometimes transformed by more functions), the output sometimes gets transformed further, and the window function eventually gets used:
- By nothing sensitive (so no warnings related to the window function)
- As part of filter condition
- As part of a join condition
- As part of an ordering key in a limit
- As part of an ordering key in the root
- As an aggregation key
| AGGREGATE(keys={'bucket': ROUND(cumavg / 10.0:numeric, 0:numeric) * 10:numeric}, aggregations={'n_rows': COUNT()}) | ||
| JOIN(condition=t0.t_sourceaccount == UNMASK::(CASE WHEN [t1.a_key] = 0 THEN 0 ELSE (CASE WHEN [t1.a_key] > 0 THEN 1 ELSE -1 END) * CAST(SUBSTRING([t1.a_key], 1 + INSTR([t1.a_key], '-'), LENGTH([t1.a_key]) / 2) AS INTEGER) END), type=INNER, cardinality=SINGULAR_FILTER, reverse_cardinality=PLURAL_FILTER, columns={'cumavg': t0.cumavg}) | ||
| PROJECT(columns={'cumavg': RELAVG(args=[SQRT(UNMASK::((1025.67 - ([t_amount]))))], partition=[a_branchkey], order=[(UNMASK::(DATETIME([t_ts], '+54321 seconds'))):asc_last], cumulative=True), 't_sourceaccount': t_sourceaccount}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can see a good instance of the propagation required to correctly identify some of the critical dependencies:
- Unmask calls to
t_amountandt_tsare used inside theRELAVGandSQRTcalls used to derivecumavg cumavgis bubbled upward through the join (which depends on unmaskinga_key)- The bubbled value of
cumavgis then divided by 10, rounded, multiplied by 10, then used as an aggregation key -> therefore the ability to unmaskt_amountandt_tsare both critical dependencies.
Therefore, we will have warning logs about the safety of unmasking t_ts, t_amount, and a_key.
However, in bubbleprop_01, we only have warning logs for a_key. Even though 01 still has the same window function & join, the value of cumavg never gets used in a sensitive manner, so if the user has bad permissions then the output will be garbage w/o necessarily being malformed (we can change this definition if we need to).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just add the docstring for extra_masking_warning_logs
| """ | ||
|
|
||
| self.critical_unmask_columns: set[str] = set() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| def extract_masking_warning_logs(log_str: str) -> dict[str, set[str]]: | ||
| """ | ||
| TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will let the final approval to @john-sanchez31 and @hadia206
| column is critical to the output of the query. | ||
| """ | ||
|
|
||
| self.expression_visitor = MaskingCriticalDetectionExpressionVisitor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need type hint for this?
Adds a relational visitor pass at the end of the relational optimizer designed to traverse the final relational tree and identify any mask/unmask calls that would cause a critical logical error if the user does not have permission. The pass finds all mask/unmask calls that, at some point in the plan, are used in a sensitive manner so one of the following will happen if the user cannot make the mask/unmask plan:
Those three types of critical errors will happen if the mask/unmask call without sufficient permissions is eventually used in one of the following:
If the values are garbled in one of the following ways, it is not considered a critical failure unless the garbled value is used in one of the sensitive manners described above:
Since PyDough does not (currently) have a way to determine whether a user has permission to mask/unmask a column, the visitor will identify all sensitive dependencies that would cause a critical error if the user does not have permission. Then, it will dump all of these dependencies via logger warnings in the following format (per column):