Skip to content

Fix concurrent cache access in AcquireToken method#578

Open
4gust wants to merge 3 commits into
mainfrom
4gust/concurrent-cache-failure
Open

Fix concurrent cache access in AcquireToken method#578
4gust wants to merge 3 commits into
mainfrom
4gust/concurrent-cache-failure

Conversation

@4gust
Copy link
Copy Markdown
Collaborator

@4gust 4gust commented Aug 28, 2025

PR: Fix Concurrent Cache Access in AcquireToken Method

Summary

This PR addresses a potential race condition in the AcquireToken method by properly synchronizing access to the cache using a mutex. Issue number #569


Changes

  • Introduced cacheAccessorMu.Lock() / defer cacheAccessorMu.Unlock() around cache operations to ensure thread-safe access.
  • Removed the canRefresh atomic variable, which was previously used to prevent concurrent refreshes which is no longer needed as whole cache is in mutex

Testing

  • Validated using the existing TestAcquireTokenConcurrency unit test to ensure correct behavior under concurrent conditions.

Breaking Changes

  • None. This is an internal, non-breaking implementation change that does not affect the public API.

@4gust 4gust requested review from bgavrilMS and rayluo as code owners August 28, 2025 14:49
@4gust 4gust requested a review from chlowell August 28, 2025 14:49
@sonarqubecloud
Copy link
Copy Markdown

httpClient: shared.DefaultClient,
retryPolicyEnabled: true,
source: source,
canRefresh: &zero,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove the declaration of zero

Comment on lines +1297 to +1299
for err := range errors {
t.Error(err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this approach instead. If there were 100 identical errors, would you want to see all of them in go test output?


// Mock client should only need to respond once if caching works correctly
mockClient := mock.NewClient()
// Add multiple responses in case caching fails (but we'll verify it doesn't)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you simply appended 1 response the mock would panic when it gets a second request and you wouldn't have to count the requests yourself

return
}

// Capture the token received
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? There can only be one token because the mock client returns a static value, and you don't need this to know whether the goroutine succeeded because if it didn't it would have written an error to the channel

Comment thread apps/managedidentity/managedidentity.go Outdated
c.authParams.Scopes = []string{resource}

cacheAccessorMu.Lock()
defer cacheAccessorMu.Unlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock is held for the entire duration of AcquireToken, including the c.getToken() HTTP call on the refresh path. This means all goroutines calling AcquireToken are fully serialized globally — if IMDS is slow or times out (up to 30s), every other goroutine blocks.

The original canRefresh CAS was more surgical: it only gated concurrent refreshes; cache hits were still concurrent. Consider holding the lock only around cache reads and writes, not across the network call:

cacheAccessorMu.RLock()
stResp, err := cacheManager.Read(...)
ar, err := base.AuthResultFromStorage(stResp)
cacheAccessorMu.RUnlock()

if needsRefresh {
    tr, er := c.getToken(ctx, resource) // no lock held during HTTP
    if er == nil {
        // re-acquire write lock to update cache
        return tr, nil
    }
}

Comment thread apps/managedidentity/managedidentity.go Outdated

// cache never uses the client because instance discovery is always disabled.
var cacheManager *storage.Manager = storage.New(nil)
var cacheAccessorMu *sync.RWMutex = &sync.RWMutex{}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cacheAccessorMu is declared as *sync.RWMutex but Lock() (exclusive write lock) is always called — RLock() is never used. This gives no benefit over a plain sync.Mutex and adds confusion. Either switch to sync.Mutex:

var cacheAccessorMu sync.Mutex

or use RLock/RUnlock for the read-only cache lookup path and reserve Lock only for writes (which also addresses the HTTP-hold concern above).

@Robbie-Microsoft
Copy link
Copy Markdown

Design note: global mutex serializes across all Client instances

Both cacheManager and the new cacheAccessorMu are package-level globals, so all Client instances — regardless of resource or identity — contend on the same lock. A caller that creates two clients for different resources will have their token acquisitions serialized against each other even though they have no shared state at the application level.

This is a pre-existing consequence of cacheManager being global, but the new mutex makes it more visible. It may be worth considering a per-client sync.Mutex paired with a per-client cache, or at minimum documenting the serialization behavior so callers know concurrent multi-resource clients don't actually run concurrently.

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Concurrent managedidentity.Client.AcquireToken() calls cause multiple token requests to identity provider before caching

4 participants