Add RFC for access control row filtering and column masking #34

BryanCutler · 2025-01-17T23:54:49Z

No description provided.

BryanCutler · 2025-01-17T23:55:15Z

tdcmeehan · 2025-01-24T20:15:20Z

RFC-0010-access-control-filters-masks.md

+
+Discussed at https://github.com/prestodb/presto/pull/21913#issuecomment-2050279419 is an approach to use the existing SPI for connector
+optimization to rewrite plan during the optimization phase. The benefits of the proposed design over this approach is that it applies
+globally to all connectors. Also, since it is an existing design that has already been in use


We should also probably mention benefits of conformity to the Trino SPI (ease of migration).

aaneja · 2025-02-04T17:05:58Z

RFC-0010-access-control-filters-masks.md

+
+#### Additional information
+
+1. What modules are involved


I think you're missing double whitespace at line endings, so this is not rendering correctly -

aaneja · 2025-02-04T17:27:05Z

RFC-0010-access-control-filters-masks.md

+
+## [Optional] Metrics
+
+This is a 0 to 1 feature and will not have any metrics.


It would be nice to have a RuntimeMetric or field in the query info that tracks how many (a count) row filters/column masks were applied to a TableScan

Yes, that sounds like it could be useful. But the access control plugin will know exactly what filters and masks are being applied, so couldn't that be done from within the plugin?

So if we have both system and catalog level filters, we would be better off having a system-level metric that capture's all applied filters/masks Also, at the moment, we do not pass/create a ConnectorSession which has the RuntimeStats to be used for adding metrics

I'm okay with having metrics for this feature as a non-goal; these can be added later

aaneja

LGTM % formatting nits

ZacBlanco · 2025-02-05T19:06:02Z

RFC-0010-access-control-filters-masks.md

+ViewExpression class to hold a filter/mask expression
+
+```java
+public ViewExpression(String identity, Optional<String> catalog, Optional<String> schema, String expression)


Some questions about ViewExpression:

Can you give some examples of what the expression field here could be?

Do row filters and column masks have different types of expressions?

Is there a particular set of variables which are allowed to be used?

Will we verify a ViewExpression is well-formed/can be applied to a table? e.g. what happens if the ViewExpression represents a non-existent variable.

For Iceberg tables which may have gone through schema evolution, Will we be able to handle querying different table versions? What will happen if the column with a mask is removed in a newer version of the table, while the older column mask/row filter may reference a non-existent column? Similarly, if the ViewExpression is updated, but we try to query an older table, what would happen?

Added some examples to the doc

A row filter will create a FilterNode in the plan, so the expression should be like a predicate. For masks, it will be a like a function with an input of the column value and output a new value.

I don't believe there are restrictions on the variables, it would be the same as the query being run

The filters and masks are verified during analysis and will raise an error if invalid.

I don't know what all is involved in Iceberg schema evolution, but to retrieve a filter/mask the qualified table name and column names are parameters.

Thanks for the clarifications. As for the schema evolution bits: In Iceberg you have you could perform the following set of operations

CREATE TABLE test(i int) ALTER TABLE test RENAME COLUMN i TO j

So imagine you have a column filter say, i % 10 set on the test table after creation. Then, someone runs the ALTER TABLE RENAME COLUMN statement to rename i to j.

If you try to query the table at this point, will the query just fail because there is no longer an i column?

Similarly, if you think in more malicious terms, you could do something like:

CREATE TABLE test(i int) ALTER TABLE test RENAME COLUMN i TO j ALTER TABLE test ADD COLUMN i

From my understanding, this would now allow the user to view data which would have previously been masked if the filters aren't updated properly.

In iceberg, you can also query old versions of tables with the previous schema

CREATE TABLE test(i int) -- say, version A ALTER TABLE test RENAME COLUMN i TO j -- say, version B -- Adjust filter to j % 10 now SELECT * FROM "test@B"

Would this result in a failure to query older versions if the masks/row filters don't apply?

I think most of this is beyond the scope of this SPI and more related to how a specific implementation of access control policy might work. If a user has restricted access to a column, then any sane policy would not allow them to alter the table on it, and stop them right there.

These are good questions though and I don't know if all malicious cases can be covered. What we want to get right is that this SPI can provide the necessary info for a specific implementation to handle these cases. Having some example implementations in the code base could help make sure these types of actions would be covered.

If a user has restricted access to a column, then any sane policy would not allow them to alter the table on it, and stop them right there.

I agree that a sane policy would stop them, but what if it was done by accident? Or someone forgets to update the policy? Like you said - I guess it would technically be up to the plugin implementation to determine the correct filter/view expression. I am just trying to understand the responsibilities of Presto vs Plugin.

Also, some more iceberg specific comments:

For iceberg tables - in order to determine which schema (and potentially columns to use) you would need to pass the entire table name to the function. However, the concern I have is that within the Iceberg plugin we have connector-specific logic for parsing the table name to determine the table version, which in-turn, helps us determine the schema and table version we query in the Iceberg connector. For example, an Iceberg table queried at time t would be referenced by "table@t". Is the string "table@t" what gets passed to the plugin? Or just "table"? This also would then require the plugin know about the Iceberg connector's implementation on how to parses the table type. We also have a relatively new secondary syntax for passing the table name which is actually part of the SQL spec: prestodb/presto#22851

This allows querying iceberg tables via SYSTEM_TIME or SYSTEM_VERSION. I would consider adding the version/timestamp information in the query to the plugin SPI as well.

I agree that a sane policy would stop them, but what if it was done by accident? Or someone forgets to update the policy? Like you said - I guess it would technically be up to the plugin implementation to determine the correct filter/view expression. I am just trying to understand the responsibilities of Presto vs Plugin.

The responsibility of Presto is to supply the qualified object name, context and other info to the access control plugin. It's not responsible to manage and update the filters/masks on the query, that needs to be done by the plugin.

I checked an Iceberg example that uses some versioning and this is what is in the QualifiedObjectName when retrieving a filter or mask:

tableName = {QualifiedObjectName@25373} "iceberg.tpch.ctas_orders@2440845133131575443$changelog" catalogName = "iceberg" schemaName = "tpch" objectName = "ctas_orders@2440845133131575443$changelog"

It looks like all the information is there, the access control would need to interpret it correctly though.

Yeah, so my concern is that we get objectName = "ctas_orders@2440845133131575443$changelog" being passed to the plugin for getRowFilters and getColumnMasks which would need to interpret this table name correctly. This string format is something specific to the Iceberg connector, so I am concerned that the plugin will need to have knowledge of the connector that the table is coming from. It would seem breaks the abstraction of this interface. It is similar with Hive if you were to access the $partitions table

So for this specific case I think there's two things that the interface needs:

system_time/system_version are sql standards that we support in presto. I think that should be one of the arguments in addition to the qualified object name. Or we add that information to QualifiedObjectName

I don't think we don't have this abstraction currently, but there needs to be some notion of "table type" that masking needs to know about. The plugin needs to correctly pick out that we're reading for example, the $partitions, $files, or $changelog system tables in order to get the right filter/column mask. Having the plugin be reponsible for parsing a connector-specific name seems wrong IMO.

so I am concerned that the plugin will need to have knowledge of the connector that the table is coming from.

You can implement ConnectorAccessControl inside connector. So access plugin inside connector can handle table schema and policy evolution.
Look Ranger example inside Hive connector module.

The plugin needs to correctly pick out that we're reading for example, the $partitions, $files, or $changelog system tables in order to get the right filter/column mask.

Will each table type have different schema and policy? Can we treat tables types as different tables?

Having the plugin be reponsible for parsing a connector-specific name seems wrong IMO.

Good observation. The same problem exists for other table level methods like AccessControl#checkCanSelectFromColumns

ZacBlanco · 2025-02-05T19:14:44Z

RFC-0010-access-control-filters-masks.md

+     *
+     * @return a mapping from columns to masks, or an empty map if not applicable. The keys of the return Map are a subset of {@code columns}.
+     */
+    default Map<ColumnMetadata, ViewExpression> getColumnMasks(Identity identity, AccessControlContext context, CatalogSchemaTableName tableName, List<ColumnMetadata> columns)


I see a lot of methods for retrieving the filters and masks, but will we support any API for setting them through Presto? Or will it solely be done through some kind of configuration/external service?

This is just for the SPI changes initially. If approved, it would be easy to port over some existing implementations that are compatible, like Ranger https://github.com/trinodb/trino/blob/9499dc82f2d23314dbc76b0443bedd121e6400eb/plugin/trino-ranger/src/main/java/io/trino/plugin/ranger/RangerSystemAccessControl.java#L822

ZacBlanco · 2025-02-05T19:15:05Z

RFC-0010-access-control-filters-masks.md

+* Add SPIs to AccessControl classes to allow retrieval of row filters and column masks
+* Add functionality for Presto to apply filters and masks to a query
+
+### [Optional] Non-goals


If there are no non-goals I would just remove this header

BryanCutler · 2025-02-10T03:51:22Z

Thanks @aaneja @ZacBlanco and @tdcmeehan for the feedback, I've updated with the suggestions

prestodb-ci added the from:IBM PRs from IBM label Jan 17, 2025

prestodb-ci requested review from a team, pratyakshsharma and anandamideShakyan and removed request for a team January 17, 2025 23:54

BryanCutler mentioned this pull request Jan 17, 2025

Add support for row filtering and column masking access control prestodb/presto#24278

Open

tdcmeehan reviewed Jan 24, 2025

View reviewed changes

BryanCutler force-pushed the rfc-governance-010 branch from f6feb26 to 04a8126 Compare January 24, 2025 20:54

tdcmeehan approved these changes Jan 24, 2025

View reviewed changes

aaneja reviewed Feb 4, 2025

View reviewed changes

aaneja approved these changes Feb 4, 2025

View reviewed changes

ZacBlanco reviewed Feb 5, 2025

View reviewed changes

Add access control row filtering and column masking RFC

4a35d96

BryanCutler force-pushed the rfc-governance-010 branch from 04a8126 to 4a35d96 Compare February 10, 2025 03:33

ZacBlanco mentioned this pull request Feb 14, 2025

Add configurable Parquet Data Anonymization feature prestodb/presto#24559

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RFC for access control row filtering and column masking #34

Add RFC for access control row filtering and column masking #34

BryanCutler commented Jan 17, 2025

BryanCutler commented Jan 17, 2025

tdcmeehan Jan 24, 2025

tdcmeehan Jan 24, 2025

aaneja Feb 4, 2025

aaneja Feb 4, 2025 •

edited

Loading

BryanCutler Feb 10, 2025

aaneja Feb 10, 2025

aaneja left a comment

ZacBlanco Feb 5, 2025

BryanCutler Feb 10, 2025

ZacBlanco Feb 10, 2025

BryanCutler Feb 12, 2025

ZacBlanco Feb 13, 2025

BryanCutler Feb 18, 2025

ZacBlanco Feb 18, 2025 •

edited

Loading

uarturm Feb 20, 2025

ZacBlanco Feb 5, 2025

BryanCutler Feb 10, 2025

ZacBlanco Feb 5, 2025

BryanCutler commented Feb 10, 2025


		## [Optional] Metrics

		This is a 0 to 1 feature and will not have any metrics.

Add RFC for access control row filtering and column masking #34

Are you sure you want to change the base?

Add RFC for access control row filtering and column masking #34

Conversation

BryanCutler commented Jan 17, 2025

BryanCutler commented Jan 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaneja Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaneja left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZacBlanco Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Feb 10, 2025

aaneja Feb 4, 2025 •

edited

Loading

ZacBlanco Feb 18, 2025 •

edited

Loading