Skip to content

Conversation

@rodmgwgu
Copy link
Contributor

@rodmgwgu rodmgwgu commented Nov 11, 2025

Related issue: #133

This PR intends to solve the issue of having to hard-load all Casbin permissions on every request, by implementing a simple per-process caching strategy with global invalidation support.

This means, that even when we have multiple copies of the Casbin policies loaded on multiple processes (uwsgi workers on a single lms or cms container instance, the lms and cms instances themselves, or multiples of these on a kubernetes cluster), we will always be sure to have the latests policies loaded in memory, by reloading them only when needed by checking an invalidation timestamp on a cross-process cache.

Note: This PR also disables the auto-reload functionality by default (by setting CASBIN_AUTO_LOAD_POLICY_INTERVAL to 0 on the defaults), as this should no longer be needed.

Approach: Handle cache invalidation via Django cache with a timestamp

It works like this (thinking on how it would work on a standard tutor prod setup):

  • Each UWSGI worker process has it's own instance of the AuthzEnforcer
  • AuthzEnforcer keeps a timestamp of the last time it loaded the policy
  • On each request to enforce, AuthzEnforcer checks the last invalidation timestamp from the Django cache (usually backed by redis) and compares to it's internal one. If it's newer, it reloads the policy before enforcing
  • On each request that changes the policy, we publish a new invalidation timestamp to the Django cache.

Concerns

The invalidation mechanism implemented here works correctly when the Django cache is configured in such a way that all lms and cms processes share the same cache backend, which is the case on the way tutor deploys the system, by using a single redis instance as the Django cache backend.

I'm not sure what other cache configurations would be supported by Open edX, in theory, a memcached backend would also work if it's a shared instance, but if the cache is set up in any way where some instances connect to different, isolated backends, then the cache invalidation won't be guaranteed across all processes and services.

Merge checklist:
Check off if complete or not applicable:

  • Version bumped
  • Changelog record added
  • Documentation updated (not only docstrings)
  • Fixup commits are squashed away
  • Unit tests added/updated
  • Manual testing instructions provided
  • Noted any: Concerns, dependencies, migration issues, deadlines, tickets

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Nov 11, 2025
@openedx-webhooks
Copy link

openedx-webhooks commented Nov 11, 2025

Thanks for the pull request, @rodmgwgu!

This repository is currently maintained by @openedx/committers-openedx-authz.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.


Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

None
"""
current_timestamp = time.time()
cache.set(cls.CACHE_KEY, current_timestamp, None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if using Django cache this way would always guarantee invalidation across processes, It may depend on the backend used, for tutor redis, it should at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a short conversation with @ormsbee I think the short answer is that it's possible to configure several different caches, at least across lms/cms so this may not be a reliable method to do the invalidation. The most reliable way would be to put it in the database, where it's guaranteed.

Copy link
Member

@mariajgrimaldi mariajgrimaldi Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have different caches for the same service? #140 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be worth wile to implement this using the database? it won't be as performant as we are adding a hit to the db, but at least it is a simple query, which seems better than reloading the whole policy on every request.

Or perhaps, make a configuration switch to use either cache or the db?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the db and just solve it for good. It should just be a 1 row table that mysql should serve out of cache.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor note: We don't necessarily care about the timestamp, and comparisons with time.time() across multiple machines can potentially introduce intermittent issues with clock skew. We mostly just care about the question of "does the current in-process state match the state of the database", which could be done by setting any random value whenever there's an invalidation and having the workers check their "current state" var with that one.

I do agree with @bmtcril that we'll probably need a more granular caching mechanism before we expand past libraries.

@rodmgwgu rodmgwgu force-pushed the rod/cache_experiment branch from cbecf01 to ef70b2c Compare November 12, 2025 15:36
@rodmgwgu rodmgwgu changed the title WIP: Experimenting with cache invalidation feat: Policy cache invalidation approach Nov 12, 2025
@rodmgwgu rodmgwgu marked this pull request as ready for review November 12, 2025 22:50
@mariajgrimaldi
Copy link
Member

mariajgrimaldi commented Nov 13, 2025

Thank you so much for moving this forward! I don’t think we can assume all installations share the same Redis setup, but here’s what we know for the MVP:

  • The enforcer will only be used by LMS and CMS processes. Our current setup uses both LMS endpoints and inline enforcements in the CMS.
  • All CMS processes should share the same cache backend for consistency (right?).
  • All LMS processes should share the same cache backend for consistency.

So we can only guarantee consistency within processes of the same service. It would be even better if we could use a single service, so the admin console could call the CMS where the inline enforcements happen. Right?

I asked internally the infra team and they don't believe this is a common practice, but better be safe I guess.

Copy link
Member

@mariajgrimaldi mariajgrimaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run make format? Thanks! Everything else looks good :)

I'll be testing around in our remote environment :)

@bmtcril
Copy link
Contributor

bmtcril commented Nov 13, 2025

Thank you so much for moving this forward! I don’t think we can assume all installations share the same Redis setup, but here’s what we know for the MVP:

  • The enforcer will only be used by LMS and CMS processes. Our current setup uses both LMS endpoints and inline enforcements in the CMS.
  • All CMS processes should share the same cache backend for consistency (right?).
  • All LMS processes should share the same cache backend for consistency.

So we can only guarantee consistency within processes of the same service. It would be even better if we could use a single service, so the admin console could call the CMS where the inline enforcements happen. Right?

I asked internally the infra team and they don't believe this is a common practice, but better be safe I guess.

I think the problem here is the invalidation. How do the LMS processes know when the CMS invalidates the cache due to a permissions update?

@MaferMazu MaferMazu linked an issue Nov 13, 2025 that may be closed by this pull request
@mariajgrimaldi
Copy link
Member

I think the problem here is the invalidation. How do the LMS processes know when the CMS invalidates the cache due to a permissions update?

Of course. This approach only works if we focus on a single service, which might be fine for the MVP. The inline enforcements happen in the CMS, so the admin could call the APIs there. But this won't hold long term. So a better option could be a singleton model that stores the invalidation data as you mentioned.

@mphilbrick211 mphilbrick211 added the mao-onboarding Reviewing this will help onboard devs from an Axim mission-aligned organization (MAO). label Nov 13, 2025
@mphilbrick211 mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Nov 13, 2025
@mariajgrimaldi mariajgrimaldi linked an issue Nov 14, 2025 that may be closed by this pull request
"""
last_modified_timestamp = PolicyCacheControl.get_last_modified_timestamp()

current_timestamp = time.time()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using time.time(), as Dave suggested, we could have sort of a counter to compare between local state (cache - local to the process) vs global state (db - shared by all processes). If the global invalidation counter > local invalidation counter then an invalidation occurred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a counter can introduce (probably very rare) race conditions where 2 processes increment to the same value and neither of them gets the other's updates, I'd just do something like a guid for safety.

@rodmgwgu rodmgwgu force-pushed the rod/cache_experiment branch from ff0d841 to f50123a Compare November 14, 2025 22:57
********************

Changed
=======
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was on purpose?

@mariajgrimaldi
Copy link
Member

mariajgrimaldi commented Nov 17, 2025

I left this comment earlier:

I tested this in an empty environment (ran the migrations for the first time and then the load_policies command) and noticed that after the first policy load, nothing triggers an invalidation. So the enforcer stays out of date, and since it starts empty, we can’t add new roles; we just get stuck. It seems like we need a way to force an invalidation on the very first load. I’m not sure if this is only happening to me because of some inconsistency in my setup.

I worked around it by calling AuthzEnforcer.get_enforcer() inside ready(), but that only works as a runtime hack since it breaks the build since the database isn't ready at that point.

Here’s exactly what happened:

image image

But now I can't reproduce it and it's working as expected. I'll leave the comment here in case it's useful in the future.

Copy link
Member

@mariajgrimaldi mariajgrimaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you so much for moving this forward :)

If we find any issues related to this, I think we can address them in a different PR. Again, thank you so much!

@bmtcril
Copy link
Contributor

bmtcril commented Nov 18, 2025

This should just need the model docstring annotated that it doesn't contain PII:

.. no_pii:

Copy link
Contributor

@BryanttV BryanttV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rodmgwgu, thank you very much for this! I tested it locally and it works very well. I just have a few minor comments.

last_version = PolicyCacheControl.get_version()

if last_version is None:
# No timestamp in cache; initialize it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

Comment on lines 839 to 840
def test_load_policy_if_needed_initializes_cache_timestamp(self, mock_toggle):
"""Test that load_policy_if_needed initializes cache timestamp on first call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to update this according to the new UUID approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

@rodmgwgu rodmgwgu force-pushed the rod/cache_experiment branch from f50123a to de47a7d Compare November 18, 2025 16:02
Copy link
Contributor

@BryanttV BryanttV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mariajgrimaldi mariajgrimaldi merged commit 125894f into openedx:main Nov 18, 2025
14 checks passed
@github-project-automation github-project-automation bot moved this from In Eng Review to Done in Contributions Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mao-onboarding Reviewing this will help onboard devs from an Axim mission-aligned organization (MAO). open-source-contribution PR author is not from Axim or 2U

Projects

Status: Done

7 participants