feat: Policy cache invalidation approach #140

rodmgwgu · 2025-11-11T22:51:36Z

Related issue: #133

This PR intends to solve the issue of having to hard-load all Casbin permissions on every request, by implementing a simple per-process caching strategy with global invalidation support.

This means, that even when we have multiple copies of the Casbin policies loaded on multiple processes (uwsgi workers on a single lms or cms container instance, the lms and cms instances themselves, or multiples of these on a kubernetes cluster), we will always be sure to have the latests policies loaded in memory, by reloading them only when needed by checking an invalidation timestamp on a cross-process cache.

Note: This PR also disables the auto-reload functionality by default (by setting CASBIN_AUTO_LOAD_POLICY_INTERVAL to 0 on the defaults), as this should no longer be needed.

Approach: Handle cache invalidation via Django cache with a timestamp

It works like this (thinking on how it would work on a standard tutor prod setup):

Each UWSGI worker process has it's own instance of the AuthzEnforcer
AuthzEnforcer keeps a timestamp of the last time it loaded the policy
On each request to enforce, AuthzEnforcer checks the last invalidation timestamp from the Django cache (usually backed by redis) and compares to it's internal one. If it's newer, it reloads the policy before enforcing
On each request that changes the policy, we publish a new invalidation timestamp to the Django cache.

Concerns

The invalidation mechanism implemented here works correctly when the Django cache is configured in such a way that all lms and cms processes share the same cache backend, which is the case on the way tutor deploys the system, by using a single redis instance as the Django cache backend.

I'm not sure what other cache configurations would be supported by Open edX, in theory, a memcached backend would also work if it's a shared instance, but if the cache is set up in any way where some instances connect to different, isolated backends, then the cache invalidation won't be guaranteed across all processes and services.

Merge checklist:
Check off if complete or not applicable:

Version bumped
Changelog record added
Documentation updated (not only docstrings)
Fixup commits are squashed away
Unit tests added/updated
Manual testing instructions provided
Noted any: Concerns, dependencies, migration issues, deadlines, tickets

openedx-webhooks · 2025-11-11T22:51:41Z

Thanks for the pull request, @rodmgwgu!

This repository is currently maintained by @openedx/committers-openedx-authz.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

rodmgwgu · 2025-11-11T22:55:15Z

openedx_authz/engine/enforcer.py

+            None
+        """
+        current_timestamp = time.time()
+        cache.set(cls.CACHE_KEY, current_timestamp, None)


Not sure if using Django cache this way would always guarantee invalidation across processes, It may depend on the backend used, for tutor redis, it should at least.

Based on a short conversation with @ormsbee I think the short answer is that it's possible to configure several different caches, at least across lms/cms so this may not be a reliable method to do the invalidation. The most reliable way would be to put it in the database, where it's guaranteed.

Could we have different caches for the same service? #140 (comment)

Do you think it would be worth wile to implement this using the database? it won't be as performant as we are adding a hit to the db, but at least it is a simple query, which seems better than reloading the whole policy on every request.

Or perhaps, make a configuration switch to use either cache or the db?

I think we should use the db and just solve it for good. It should just be a 1 row table that mysql should serve out of cache.

Minor note: We don't necessarily care about the timestamp, and comparisons with time.time() across multiple machines can potentially introduce intermittent issues with clock skew. We mostly just care about the question of "does the current in-process state match the state of the database", which could be done by setting any random value whenever there's an invalidation and having the workers check their "current state" var with that one.

I do agree with @bmtcril that we'll probably need a more granular caching mechanism before we expand past libraries.

mariajgrimaldi · 2025-11-13T12:48:43Z

Thank you so much for moving this forward! I don’t think we can assume all installations share the same Redis setup, but here’s what we know for the MVP:

The enforcer will only be used by LMS and CMS processes. Our current setup uses both LMS endpoints and inline enforcements in the CMS.
All CMS processes should share the same cache backend for consistency (right?).
All LMS processes should share the same cache backend for consistency.

So we can only guarantee consistency within processes of the same service. It would be even better if we could use a single service, so the admin console could call the CMS where the inline enforcements happen. Right?

I asked internally the infra team and they don't believe this is a common practice, but better be safe I guess.

mariajgrimaldi

Can we run make format? Thanks! Everything else looks good :)

I'll be testing around in our remote environment :)

openedx_authz/settings/common.py

openedx_authz/__init__.py

bmtcril · 2025-11-13T17:09:49Z

Thank you so much for moving this forward! I don’t think we can assume all installations share the same Redis setup, but here’s what we know for the MVP:

The enforcer will only be used by LMS and CMS processes. Our current setup uses both LMS endpoints and inline enforcements in the CMS.

All CMS processes should share the same cache backend for consistency (right?).

All LMS processes should share the same cache backend for consistency.

So we can only guarantee consistency within processes of the same service. It would be even better if we could use a single service, so the admin console could call the CMS where the inline enforcements happen. Right?

I asked internally the infra team and they don't believe this is a common practice, but better be safe I guess.

I think the problem here is the invalidation. How do the LMS processes know when the CMS invalidates the cache due to a permissions update?

mariajgrimaldi · 2025-11-13T19:35:08Z

I think the problem here is the invalidation. How do the LMS processes know when the CMS invalidates the cache due to a permissions update?

Of course. This approach only works if we focus on a single service, which might be fine for the MVP. The inline enforcements happen in the CMS, so the admin could call the APIs there. But this won't hold long term. So a better option could be a singleton model that stores the invalidation data as you mentioned.

mariajgrimaldi · 2025-11-14T10:35:37Z

openedx_authz/engine/enforcer.py

+        """
+        last_modified_timestamp = PolicyCacheControl.get_last_modified_timestamp()
+
+        current_timestamp = time.time()


Instead of using time.time(), as Dave suggested, we could have sort of a counter to compare between local state (cache - local to the process) vs global state (db - shared by all processes). If the global invalidation counter > local invalidation counter then an invalidation occurred.

I think a counter can introduce (probably very rare) race conditions where 2 processes increment to the same value and neither of them gets the other's updates, I'd just do something like a guid for safety.

mariajgrimaldi · 2025-11-17T10:30:25Z

CHANGELOG.rst

 ********************

 Changed
-=======


I don't think this was on purpose?

mariajgrimaldi · 2025-11-17T17:09:00Z

I left this comment earlier:

I tested this in an empty environment (ran the migrations for the first time and then the load_policies command) and noticed that after the first policy load, nothing triggers an invalidation. So the enforcer stays out of date, and since it starts empty, we can’t add new roles; we just get stuck. It seems like we need a way to force an invalidation on the very first load. I’m not sure if this is only happening to me because of some inconsistency in my setup.

I worked around it by calling AuthzEnforcer.get_enforcer() inside ready(), but that only works as a runtime hack since it breaks the build since the database isn't ready at that point.

Here’s exactly what happened:

But now I can't reproduce it and it's working as expected. I'll leave the comment here in case it's useful in the future.

mariajgrimaldi

LGTM! Thank you so much for moving this forward :)

If we find any issues related to this, I think we can address them in a different PR. Again, thank you so much!

bmtcril · 2025-11-18T13:17:22Z

This should just need the model docstring annotated that it doesn't contain PII:

.. no_pii:

BryanttV

@rodmgwgu, thank you very much for this! I tested it locally and it works very well. I just have a few minor comments.

BryanttV · 2025-11-18T14:17:32Z

openedx_authz/engine/enforcer.py

+        last_version = PolicyCacheControl.get_version()
+
+        if last_version is None:
+            # No timestamp in cache; initialize it


We need to update this comment

done, thanks!

BryanttV · 2025-11-18T14:23:42Z

openedx_authz/tests/test_enforcer.py

+    def test_load_policy_if_needed_initializes_cache_timestamp(self, mock_toggle):
+        """Test that load_policy_if_needed initializes cache timestamp on first call.


I think we also need to update this according to the new UUID approach

done, thanks!

BryanttV

LGTM!

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Nov 11, 2025

openedx-webhooks added this to Contributions Nov 11, 2025

github-project-automation bot moved this to Needs Triage in Contributions Nov 11, 2025

rodmgwgu commented Nov 11, 2025

View reviewed changes

rodmgwgu force-pushed the rod/cache_experiment branch from cbecf01 to ef70b2c Compare November 12, 2025 15:36

rodmgwgu changed the title ~~WIP: Experimenting with cache invalidation~~ feat: Policy cache invalidation approach Nov 12, 2025

rodmgwgu marked this pull request as ready for review November 12, 2025 22:50

rodmgwgu requested review from BryanttV, bmtcril and mariajgrimaldi November 12, 2025 22:50

mariajgrimaldi reviewed Nov 13, 2025

View reviewed changes

openedx_authz/settings/common.py Show resolved Hide resolved

mariajgrimaldi reviewed Nov 13, 2025

View reviewed changes

openedx_authz/__init__.py Outdated Show resolved Hide resolved

MaferMazu linked an issue Nov 13, 2025 that may be closed by this pull request

Hard-load of Casbin policy on permission validation #133

Closed

mariajgrimaldi linked an issue Nov 13, 2025 that may be closed by this pull request

Bug: 403 when opening library dashboard as "guillermotest" team member #120

Closed

mphilbrick211 added the mao-onboarding Reviewing this will help onboard devs from an Axim mission-aligned organization (MAO). label Nov 13, 2025

mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Nov 13, 2025

mariajgrimaldi linked an issue Nov 14, 2025 that may be closed by this pull request

GET endpoints are not updating the UI after an action is send #116

Closed

mariajgrimaldi reviewed Nov 14, 2025

View reviewed changes

rodmgwgu force-pushed the rod/cache_experiment branch from ff0d841 to f50123a Compare November 14, 2025 22:57

mariajgrimaldi reviewed Nov 17, 2025

View reviewed changes

CHANGELOG.rst

********************

Changed

=======

Copy link

Member

mariajgrimaldi Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was on purpose?

mariajgrimaldi approved these changes Nov 18, 2025

View reviewed changes

BryanttV suggested changes Nov 18, 2025

View reviewed changes

rodmgwgu added 2 commits November 18, 2025 09:50

feat: Experimenting with cache invalidation

a655827

squash!: Cleanup

e35079e

rodmgwgu added 10 commits November 18, 2025 09:50

squash!: Re-enable autoload setting, but default to disabled

fc90291

squash!: Cleanup

bccdd3d

squash!: Add tests

132b339

squash!: Update Changelog and version

a45426e

squash!: Format code, bump version

a4292b5

squash!: Changed approach to use DB instead of cache

67fbbdb

squash!: Add tests for PolicyCacheControl model

8da767b

squash!: Changed implementation to use UUID for cache invalidation

5f03619

squash!: Attend PR comments

b7f31cc

squash!: Fix migrations conflict

de47a7d

rodmgwgu force-pushed the rod/cache_experiment branch from f50123a to de47a7d Compare November 18, 2025 16:02

BryanttV approved these changes Nov 18, 2025

View reviewed changes

bmtcril approved these changes Nov 18, 2025

View reviewed changes

mariajgrimaldi merged commit 125894f into openedx:main Nov 18, 2025
14 checks passed

github-project-automation bot moved this from In Eng Review to Done in Contributions Nov 18, 2025

		def test_load_policy_if_needed_initializes_cache_timestamp(self, mock_toggle):
		"""Test that load_policy_if_needed initializes cache timestamp on first call.

feat: Policy cache invalidation approach #140

feat: Policy cache invalidation approach #140

Uh oh!

Conversation

rodmgwgu commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approach: Handle cache invalidation via Django cache with a timestamp

Concerns

Uh oh!

openedx-webhooks commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mariajgrimaldi Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mariajgrimaldi commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariajgrimaldi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bmtcril commented Nov 13, 2025

Uh oh!

mariajgrimaldi commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mariajgrimaldi commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariajgrimaldi left a comment

Choose a reason for hiding this comment

Uh oh!

bmtcril commented Nov 18, 2025

Uh oh!

BryanttV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanttV left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rodmgwgu commented Nov 11, 2025 •

edited

Loading

openedx-webhooks commented Nov 11, 2025 •

edited

Loading

mariajgrimaldi Nov 13, 2025 •

edited

Loading

mariajgrimaldi commented Nov 13, 2025 •

edited

Loading

mariajgrimaldi commented Nov 17, 2025 •

edited

Loading