IMDS retry exponential backoff #568
Conversation
|
| // For IMDS, use exponential backoff based on attempt number | ||
| var waitTime time.Duration | ||
| if c.source == DefaultToIMDS { | ||
| // Exponential backoff with base of 1 second: 1s, 2s, 4s, 8s, etc. |
There was a problem hiding this comment.
This is less backoff than recommended: https://learn.microsoft.com/entra/identity/managed-identities-azure-resources/how-to-use-vm-token#retry-guidance
There was a problem hiding this comment.
Updated the time to reflect the time.
There was a problem hiding this comment.
Still not addressed. The current code starts at 1<<0 = 1s for the first attempt (1s, 2s, 4s, 8s…). The IMDS retry guidance recommends starting at 2s (2^retry), so the series should be 2s, 4s, 8s, 16s…. Also worth noting: there's currently no cap on the wait duration — at attempt 10 the code would wait 1024s. Recommend something like:
waitTime := time.Duration(2<<uint(attempt)) * time.Second
if waitTime > 60*time.Second {
waitTime = 60 * time.Second
}| if i < len(tt.mockResponses)-1 { | ||
| mockClient.AppendResponse(mock.WithBody(body.Bytes()), mock.WithHTTPStatusCode(resp.statusCode), mock.WithCallback(callback)) | ||
| } else { | ||
| mockClient.AppendResponse(mock.WithBody(body.Bytes()), mock.WithHTTPStatusCode(resp.statusCode), mock.WithCallback(callback)) | ||
| } |
There was a problem hiding this comment.
Still present in the updated diff. Lines 91–95 of the test have identical branches:
if i < len(tt.mockResponses)-1 {
mockClient.AppendResponse(..., mock.WithCallback(callback))
} else {
mockClient.AppendResponse(..., mock.WithCallback(callback)) // identical
}The condition does nothing — both branches call the same thing. Should just be:
mockClient.AppendResponse(mock.WithBody(body.Bytes()), mock.WithHTTPStatusCode(resp.statusCode), mock.WithCallback(callback))| } | ||
| select { | ||
| case <-time.After(time.Second): | ||
| case <-time.After(waitTime): |
There was a problem hiding this comment.
This makes the tests slow. Consider enabling them to skip the wait, for example by adding a hook like this:
var after = func(d time.Duration) <-chan time.Time {
return time.After(d)
}
// in a test case
after = func(d time.Duration) <-chan time.Time {
// TODO: validate d
ch := make(chan time.Time, 1)
ch <- time.Now()
return ch
}(has the additional benefit of preventing flakiness due to unpredictable scheduling)
There was a problem hiding this comment.
Updated, and added after.
This feels like we should always have a wrapper for time for testing all the time related things.
There was a problem hiding this comment.
Still not addressed. The current diff still has case <-time.After(waitTime): with no overrideable hook. The new test cases actually sleep for real (1s + 2s + 4s = 7s for the IMDS exponential backoff case alone) and then validate wall-clock timing with ±500ms tolerance — exactly the flakiness concern raised here.




Description
This pull request introduces enhancements to the IMDS (Instance Metadata Service) retry logic in the Microsoft Authentication Library for Go (MSAL Go). The updates aim to improve the reliability and resilience of token acquisition when interacting with Azure IMDS endpoints, particularly under transient network failures or service unavailability scenarios.
Key changes:
These improvements help ensure more consistent authentication experiences for applications running in Azure environments that rely on managed identities.
Please review the changes and provide feedback or suggestions for further improvements.