Skip to content

Conversation

@knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Oct 27, 2025

Adds a relational visitor pass at the end of the relational optimizer designed to traverse the final relational tree and identify any mask/unmask calls that would cause a critical logical error if the user does not have permission. The pass finds all mask/unmask calls that, at some point in the plan, are used in a sensitive manner so one of the following will happen if the user cannot make the mask/unmask plan:

  • The answer will have the wrong number of rows
  • The answer will include the wrong subset of rows
  • The answer will have the rows in the wrong order

Those three types of critical errors will happen if the mask/unmask call without sufficient permissions is eventually used in one of the following:

  • The condition of a join
  • The condition of a filter
  • An ordering key for a limit
  • An ordering key for the relational root (ordering the final table)

If the values are garbled in one of the following ways, it is not considered a critical failure unless the garbled value is used in one of the sensitive manners described above:

  • Calling a function/aggregation/window function on mask/unmask calls
  • Using mask/unmask on the data presented as the final answer
  • Using mask/unmask calls as the partition or order keys for a window function

Since PyDough does not (currently) have a way to determine whether a user has permission to mask/unmask a column, the visitor will identify all sensitive dependencies that would cause a critical error if the user does not have permission. Then, it will dump all of these dependencies via logger warnings in the following format (per column):

WARNING  pydough.logger.logger:masking_critical_detection_visitor.py:148 Query will not produce a valid output unless user has permission to mask column `CRBNK.CUSTOMERS.c_fname`
WARNING  pydough.logger.logger:masking_critical_detection_visitor.py:152 Query will not produce a valid output unless user has permission to unmask column `CRBNK.CUSTOMERS.c_key`
WARNING  pydough.logger.logger:masking_critical_detection_visitor.py:152 Query will not produce a valid output unless user has permission to unmask column `CRBNK.ACCOUNTS.a_key`

"MASK": set(),
"UNMASK": {"CRBNK.ACCOUNTS.a_key"},
},
id="cryptbank_bubbleprop_01",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These "bubbleprop" tests all serve a specific purpose for the warning tracking: they all cover different kinds of situations where an expression is defined in one place that depends on unmask calls, gets propagated further along in the plan, then later gets used in a sensitive manner. Specifically, all of them are a window function where the inputs are sensitive columns (sometimes transformed by more functions), the output sometimes gets transformed further, and the window function eventually gets used:

  1. By nothing sensitive (so no warnings related to the window function)
  2. As part of filter condition
  3. As part of a join condition
  4. As part of an ordering key in a limit
  5. As part of an ordering key in the root
  6. As an aggregation key

Comment on lines +2 to +4
AGGREGATE(keys={'bucket': ROUND(cumavg / 10.0:numeric, 0:numeric) * 10:numeric}, aggregations={'n_rows': COUNT()})
JOIN(condition=t0.t_sourceaccount == UNMASK::(CASE WHEN [t1.a_key] = 0 THEN 0 ELSE (CASE WHEN [t1.a_key] > 0 THEN 1 ELSE -1 END) * CAST(SUBSTRING([t1.a_key], 1 + INSTR([t1.a_key], '-'), LENGTH([t1.a_key]) / 2) AS INTEGER) END), type=INNER, cardinality=SINGULAR_FILTER, reverse_cardinality=PLURAL_FILTER, columns={'cumavg': t0.cumavg})
PROJECT(columns={'cumavg': RELAVG(args=[SQRT(UNMASK::((1025.67 - ([t_amount]))))], partition=[a_branchkey], order=[(UNMASK::(DATETIME([t_ts], '+54321 seconds'))):asc_last], cumulative=True), 't_sourceaccount': t_sourceaccount})
Copy link
Contributor Author

@knassre-bodo knassre-bodo Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can see a good instance of the propagation required to correctly identify some of the critical dependencies:

  • Unmask calls to t_amount and t_ts are used inside the RELAVG and SQRT calls used to derive cumavg
  • cumavg is bubbled upward through the join (which depends on unmasking a_key)
  • The bubbled value of cumavg is then divided by 10, rounded, multiplied by 10, then used as an aggregation key -> therefore the ability to unmask t_amount and t_ts are both critical dependencies.

Therefore, we will have warning logs about the safety of unmasking t_ts, t_amount, and a_key.

However, in bubbleprop_01, we only have warning logs for a_key. Even though 01 still has the same window function & join, the value of cumavg never gets used in a sensitive manner, so if the user has bad permissions then the output will be garbage w/o necessarily being malformed (we can change this definition if we need to).

@knassre-bodo knassre-bodo requested a review from a team October 28, 2025 19:52
@knassre-bodo knassre-bodo marked this pull request as ready for review October 28, 2025 19:52
@knassre-bodo knassre-bodo requested review from hadia206, john-sanchez31 and juankx-bodo and removed request for a team October 28, 2025 19:52
Copy link
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just add the docstring for extra_masking_warning_logs

"""

self.critical_unmask_columns: set[str] = set()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change


def extract_masking_warning_logs(log_str: str) -> dict[str, set[str]]:
"""
TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

Copy link
Contributor

@juankx-bodo juankx-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will let the final approval to @john-sanchez31 and @hadia206

column is critical to the output of the query.
"""

self.expression_visitor = MaskingCriticalDetectionExpressionVisitor()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need type hint for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants