Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [Resource Access Control] Finalizing the code design #5062

Open
DarshitChanpura opened this issue Jan 27, 2025 · 12 comments
Open

[RFC] [Resource Access Control] Finalizing the code design #5062

DarshitChanpura opened this issue Jan 27, 2025 · 12 comments
Assignees
Labels
resource-permissions Label to track all items related to resource permissions triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.

Comments

@DarshitChanpura
Copy link
Member

DarshitChanpura commented Jan 27, 2025

Background

W.r.t Resource Access Control, Doc-Level Security (DLS) approach has been updated in PR #5016. Recently, the plan shifted from implementing abstract APIs in OpenSearch core to modifying the Security plugin so it can automatically invoke resource-access-control for relevant indices. However, this method also has certain drawbacks around thread exhaustion. To guide the final decision, three primary approaches have been considered:

Below is an updated version of the approaches, each with Advantages and Limitations sections.


1. Terms Lookup Query

Description
Leverage TLQ to dynamically fetch resource-sharing information from a separate index, then match requested resource IDs against those entries.

Advantages

  • Native Query: Uses standard OpenSearch query features, so minimal custom logic is required.
  • Simple Integration: If resource-sharing data is already in a single document, TLQ is straightforward to set up.
  • Built-in Caching: TLQ benefits from the caching and query optimizations provided by OpenSearch.

Limitations

  • Single Document Constraint: TLQ requires all resource IDs for a user (or resource) to be in a single document, which is rarely feasible for real-world data scattered across multiple docs.
  • Scalability Issues: Merging large sets of resource IDs into one document can become unwieldy, leading to performance or storage problems.
  • Narrow Applicability: If the resource-sharing model is more complex, TLQ quickly becomes impractical as a generic solution.

2. In-Memory Map

Description
Load resource-sharing configuration into an in-memory map—similar to how the Security plugin loads its main security configuration. This map would be updated in near real-time whenever resource-sharing information changes (e.g., new resources or updated permissions).

Advantages

  • Fast Lookups: In-memory data structures can offer very quick read performance.
  • Direct Integration: Follows the same pattern as existing Security config, which is already well understood.
  • Low Runtime Query Overhead: No need to perform frequent index lookups if all sharing data is already in memory.

Limitations

  • Frequent Updates: Resource-sharing data can change often (user grants/revokes), leading to continuous map updates.
  • Scalability & Distribution: Synchronizing frequent changes across a cluster can become a bottleneck and risk DoS if updates spike.
  • Operational Complexity: Requires robust mechanisms to keep the in-memory map consistent across all nodes.

3. Plugins Make API Calls

Description
Expose new APIs that other plugins can call whenever they need to check if a user has access to a given resource. These APIs handle the logic for determining resource access, potentially using the DLS approach behind the scenes or another method.

Advantages

  • Flexibility: Can be applied to resources stored in an index or elsewhere (e.g., external systems).
  • Centralized Logic: Minimizes the risk of misconfiguration by consolidating access checks in one place.
  • Extensibility: Provides a uniform interface, making it easier to evolve or integrate new resource types in the future.

Limitations

  • Implementation Complexity: Requires designing and maintaining well-defined, backward-compatible APIs.
  • Human Error: Plugin developers must remember to call these APIs correctly and consistently.
  • Performance Overheads: Multiple API calls could introduce latency, especially under high load.

Conclusion

  1. Terms Lookup Query is too restrictive due to the single-document requirement.
  2. In-Memory Map could create scalability issues with frequent updates.
  3. APIs for Resource Verification are more flexible and extensible, albeit with higher implementation complexity and reliance on proper usage by plugin developers.

Feedback from plugin developers and end users will help guide the final choice. Each approach has trade-offs in terms of performance, maintainability, and extensibility.

@DarshitChanpura DarshitChanpura added the resource-permissions Label to track all items related to resource permissions label Jan 27, 2025
@github-actions github-actions bot added the untriaged Require the attention of the repository maintainers and may need to be prioritized label Jan 27, 2025
@DarshitChanpura
Copy link
Member Author

@cwperks @nibix @reta We should pour in our thoughts and finalize the approach here.

@DarshitChanpura DarshitChanpura changed the title [Resource Access Control] Finalizing the code design [RFC] [Resource Access Control] Finalizing the code design Jan 27, 2025
@cwperks cwperks added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Jan 27, 2025
@cwperks
Copy link
Member

cwperks commented Jan 27, 2025

@DarshitChanpura I think we should expand on each option in a decision doc with more detail and capture some of what's been discussed in person on Github.

As I see, we have discussed 2 main approaches, but there is now a 3rd one coming into view that takes into account the https://github.com/opensearch-project/opensearch-remote-metadata-sdk/ where resource metadata could be stored outside of OpenSearch.

In the first 2 designs, we were operating under the assumption that resource metadata is stored in a system index. In this design, from a plugin developer's point of view they only need to tell the security plugin what index the sharable resources are stored in and security would handle everything else. Plugins would be unaware of whether the cluster was running with security or not as the same code would be written for both cases.

Those 2 options are:

  1. Store the resource_user and shared_with info with the resource metadata (similar to how its done today, but standardizing it and having security be the one to write and control this info)
  2. Centrally storing sharing information in an index for all resource types

Definitions:

  1. resource_user - The creator of the resource
  2. shared_with - A data structure that contains sharing info (and will be designed to support resource authorization as well which allows the sharer to specific the level of access when sharing)

With shared_with there are a couple of different ways that resources can be shared. @DarshitChanpura has been referring to this as Recipient Type

  1. users - Direct share by username
  2. role - Sharing based on the mapped roles (Roles contained in the security index)
  3. backend_role - This is pertinent to SSO users and these are roles from the backend identity provider.

Each shared_with would also be associated with an action group to specify the level of access that the target group of recipients has to the sharable resource.

Conditions for sharing

With resource sharing, there are 2 conditions in which a resource is visible to the authenticated user.

  1. The authenticated user is the resource owner
  2. The resource has been shared with the authenticated user (either via username, role or backend_role) at any access level

1. Store resource owner and sharing info w/ the resource metadata

In this approach, there must be a way for security to write the resource_user and shared_with info to the resource metadata document. Ideally, this data is protected such that only the security plugin can make updates to these fields.

  1. Search Request - When a plugin makes a search request, security would perform DLS behind the scenes where it would add a term-level query to only return documents that either 1) the authenticated user is the owner of or 2) the resource has been shared with the authenticated user
  2. Get Request - For Get Request, we would need to ensure that the authenticated user could only get documents that meet conditions 1 and 2 described above

2. Centrally storing sharing information in an index for all resource types (preferred)

Similar to 1, but centrally stored in a single index for all resource sharing info across the cluster. In this model, it can be assumed that this information is safe from being overridden because security owns the index, but it does introduce complications when plugins perform Search Requests and Get Requests on their resource indices.

  1. Search Request - When a plugin makes a search request, security fist needs to obtain the docIDs of the resources visible to authenticated user. After the docIDs are collected, security will ensure that the search request can only be performed on those docIDs
  2. Get Request - Similar to above, but if the docID is not contained in the list then security can fail the Get Request since the resource either doesn't exist or is not visible to the authenticated user

Considerations for resource metadata stored outside of OpenSearch

With https://github.com/opensearch-project/opensearch-remote-metadata-sdk/, there is an effort to abstract away metadata storage for plugins and allow metadata to be stored outside of OpenSearch. With that in mind, I think the design for resource sharing should account for this to regardless of whether 1) resource metadata is stored in OpenSearch or 2) resource metadata is stored outside of OpenSearch.

I like the idea of extrapolating the concept of DLS to remote stores, but I'm not sure how best to design that. One thing I was thinking about was whether to give plugin developers a mechanism for obtaining the IDs of resources visible to the authenticated user and leaving it up to the plugin developer to use that appropriated.

i.e. From a plugin, they can make a call similar to:

// This is pseudo-code
ResourceSharingService<SampleResource> sharingService; // sharingService is assigned if the security plugin is installed

Set<String> visibleResourceIds;

if (sharingService != null) {
     // Supports pagination?
     visibleResourceIds = sharingService.getResourceIdsForCurrentUser();
}

SearchRequest searchReq = new SearchRequest(resourceIndex);
if (visibleResourceIds != null) {
    // plugin dev is responsible for adding the filter here
}

If the plugin uses a remote store for resource metadata then they can figure out how to use the resource ids appropriately.

The security plugin will also needs hooks onto when sharable resources are created/deleted.

@nibix
Copy link
Collaborator

nibix commented Feb 3, 2025

A couple of additions:

Avoiding "human error"

The issue lists as limitation of the last option "3. Plugins Make API Calls":

Human Error: Plugin developers must remember to call these APIs correctly and consistently.

Even though the DLS approaches reduce this risk of security issue by wrongly using the provided concepts, they do not fix these completely. The DLS approaches always assume that a resource corresponds to exactly one document in an index. There are many thinkable cases where this is not the case, for example the alerting plugin has a concept called "alert comments" where the plugin implementation needs to do more checks to ensure authorized access: https://github.com/opensearch-project/alerting/blob/main/alerting/src/main/kotlin/org/opensearch/alerting/transport/TransportIndexAlertingCommentAction.kt ... thus, this risk to a certain extend applies to all approaches.

DLS

It is already mentioned in the issue, but I'd just like to put emphasis on this: DLS has quite a few limitations which make DLS based approaches challenging and expensive to achieve.

  • Enforcing DLS rules depending on cross index information is not possible with the classic, lucene based DLS. It can be achieved with a term lookup query and filter level DLS. However, filter level DLS is also subject to a couple of limitations.
  • DLS only applies to read operations. A resource access control mechanism however also needs to control write operations (so that it is not possible for a user to manipulate/overwrite resources not owned by them). It might be possible to achieve a limited DLS for write operations, but this needs additional research, design and problem solving. Each write operation needs to be considered how to enforce access controls there.

@DarshitChanpura
Copy link
Member Author

Given the scalability issues with TLQ and synchronization issues with in-memory map when it comes to large clusters, API calls seems to be the "proper" solutions. With its only downside being that plugin devs must remember to call these APIs, this approach gives us the flexibility to handle internal/external data-store as well overcome DLS short-comings. I've already introduced appropriate APIs in #5016. If the implementations looks okay, we can finalize the approach.
Thoughts @reta @cwperks @nibix ?

@reta
Copy link
Collaborator

reta commented Feb 18, 2025

Thoughts @reta @cwperks @nibix ?

@DarshitChanpura @nibix @cwperks no intent to hold you folks off, but I believe instead of inventing yet another model, refining / improving / extending the existing security ones (DLS, etc) would not only benefit the existing applications but significantly reduce the maintenance costs of supporting both.

@DarshitChanpura
Copy link
Member Author

@reta From what I understand DLS is the only model that fits this use-case and it was in-fact modified but as @nibix mentioned it does have some short-comings when it comes to exploring full potential of resource sharing model. Hence, API based approach is suggested as path forward.

@nibix
Copy link
Collaborator

nibix commented Feb 19, 2025

To be honest, it seems a bit like this conversation is kind of stuck. There are good arguments for each side, but it seems to be unclear how to come to a synthesis.

In order to move this forward a bit, I will try to create a bit more transparent (and possibly objective?) comparison of the possible approaches below. I will certainly get things wrong, so please correct me! :-)

Comparison

Approaches

The discussed approaches. Each approach corresponds to one column in the matrix below.

  1. API based
  2. DLS based, resource information inside additional index
  3. DLS based, resource information inside resource documents

Matrix

Rows describe whether an approach has a certain property or not. See the section below for longer explanation.

1. API 2. DLS ext 3. DLS int
Covers complicated resource models
No explicit checks in plugin necessary
Suitable for big data resources ?
Side-effect free
Provides benefits beyond the feature
Straight-forward to implement
Can be implemented within 4 weeks

Details

Covers complicated resource models

Can an approach cover scenarios which have resource models where there is not a 1:1 relationship between a resource and a document. Examples: Derived resources, runtime resources, hierarchical models (folder with items, each a resource).

No explicit checks in plugin necessary

Does the approach provide a solution that a plugin can very easily use or is greater integration of the plugin code necessary?

Suitable for big data resources

Is the approach suitable for millions of resources? That means, can OpenSearch big data solutions be used to manage the big data. The DLS ext approach does require keeping the resource sharing information in the heap, thus it is not suitable for big data scenarios.

Side-effect free

Does the approach provide side channels which might expose information which is not really necessary? The DLS int approach requires updating of the resource documents whenever sharing config is changed. That means that the document version and seqno properties will be updated. This has effects on concurrent operations which rely on this information. This also exposes to each user who has access to a document the fact that resource sharing configuration was modified.

Straight-forward to implement

Is it already quite clear how this can be implemented? Or, is experimentation necessary to achieve the implementation?

Provides benefits beyond the feature

Implementation might also be used to provide enhancements in existing features, e.g. DLS.

Can be implemented within 4 weeks

Disclaimer: Obviously, this is a very subjective assessment.

@nibix
Copy link
Collaborator

nibix commented Feb 19, 2025

Just my opinion (in no way authoritative):

I do believe that DLS could use some fundamental improvements. Especially the limitation that it only works for read operations, but does not prevent deleting or overwriting unauthorized documents is kind of odd and surprising.

However, a property of good project management is to keep goals at a manageable size. A goal should be a single goal and not actually a combination of several goals.

Thus, one could start with the goal of improving DLS. When this is achieved, one could revisit the resource sharing approach (which will have limitations as explained above).

If, however, the goal to achieve a resource sharing framework cannot be delayed, it's clear that DLS should not be utilized for that.

@nibix
Copy link
Collaborator

nibix commented Feb 19, 2025

Another note:

The argument that the approach of requiring the plugin to call a dedicated API was so far quite abstract. This makes it difficult for me to judge how big the issue actually is. To judge this better, I personally would think that a prototypical adaption of a real-world plugin (for example alerting) would be very helpful.

@DarshitChanpura
Copy link
Member Author

@nibix Thank you for the detailed inputs. I already have anomaly-detection plugin in progress to be tested with this new change: opensearch-project/anomaly-detection#1400. This is still in draft state, as the core feature is stalled. Now that we have consensus, I can make progress on the feature PR and subsequently the AD plugin PR.

Improving DLS can be done as a separate goal. I will open another issue for that improvement. For the resource sharing feature, API based model can be introduced. Once DLS enhancement is in place in future, we can provide plugin developers with an option to rely on DLS or continue with explicit calls depending on their use-case.

@cwperks
Copy link
Member

cwperks commented Feb 19, 2025

The argument that the approach of requiring the plugin to call a dedicated API was so far quite abstract.

Let's be specific when we bring up what this means. When I read API, my mind jumps to REST API and I don't think that's what @DarshitChanpura had in mind.

Essentially, plugin devs already add a clause of their own on a SearchRequest (Example in AD) when searching through a system index containing resource metadata.

I think @DarshitChanpura has been thinking of "replicating" this logic in a sense where plugin devs need to make a call (List<> resourceIds = sharingService.getVisibleResourceIds()) to get a clause to use when searching for resource metadata. The clause may be something like AND resource_id IN (id1, id2, ...) (thinking in SQL, but trying to convey the idea).

Keep in mind that resource metadata could be stored outside of OpenSearch so I'm wondering if this could be truly agnostic with a simple assumption that any sharable resource has a unique identifier.

@DarshitChanpura
Copy link
Member Author

Let's be specific when we bring up what this means. When I read API, my mind jumps to REST API and I don't think that's what @DarshitChanpura had in mind.

I'm thinking about REST APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resource-permissions Label to track all items related to resource permissions triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.
Projects
None yet
Development

No branches or pull requests

4 participants