Skip to content

IMDS retry exponential backoff #568

Open
4gust wants to merge 3 commits into
mainfrom
4gust/imds-retry-update
Open

IMDS retry exponential backoff #568
4gust wants to merge 3 commits into
mainfrom
4gust/imds-retry-update

Conversation

@4gust
Copy link
Copy Markdown
Collaborator

@4gust 4gust commented May 19, 2025

Description

This pull request introduces enhancements to the IMDS (Instance Metadata Service) retry logic in the Microsoft Authentication Library for Go (MSAL Go). The updates aim to improve the reliability and resilience of token acquisition when interacting with Azure IMDS endpoints, particularly under transient network failures or service unavailability scenarios.

Key changes:

  • Adjusted backoff intervals and retry limits for improved fault tolerance.
  • Added or updated corresponding unit tests to ensure correct retry behavior.

These improvements help ensure more consistent authentication experiences for applications running in Azure environments that rely on managed identities.


Please review the changes and provide feedback or suggestions for further improvements.

@4gust 4gust requested review from bgavrilMS and rayluo as code owners May 19, 2025 13:56
@4gust 4gust requested review from AndyOHart and chlowell and removed request for AndyOHart May 19, 2025 13:56
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

// For IMDS, use exponential backoff based on attempt number
var waitTime time.Duration
if c.source == DefaultToIMDS {
// Exponential backoff with base of 1 second: 1s, 2s, 4s, 8s, etc.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the time to reflect the time.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not addressed. The current code starts at 1<<0 = 1s for the first attempt (1s, 2s, 4s, 8s…). The IMDS retry guidance recommends starting at 2s (2^retry), so the series should be 2s, 4s, 8s, 16s…. Also worth noting: there's currently no cap on the wait duration — at attempt 10 the code would wait 1024s. Recommend something like:

waitTime := time.Duration(2<<uint(attempt)) * time.Second
if waitTime > 60*time.Second {
    waitTime = 60 * time.Second
}

Comment on lines +281 to +285
if i < len(tt.mockResponses)-1 {
mockClient.AppendResponse(mock.WithBody(body.Bytes()), mock.WithHTTPStatusCode(resp.statusCode), mock.WithCallback(callback))
} else {
mockClient.AppendResponse(mock.WithBody(body.Bytes()), mock.WithHTTPStatusCode(resp.statusCode), mock.WithCallback(callback))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if A {
    B
} else {
    B
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still present in the updated diff. Lines 91–95 of the test have identical branches:

if i < len(tt.mockResponses)-1 {
    mockClient.AppendResponse(..., mock.WithCallback(callback))
} else {
    mockClient.AppendResponse(..., mock.WithCallback(callback)) // identical
}

The condition does nothing — both branches call the same thing. Should just be:

mockClient.AppendResponse(mock.WithBody(body.Bytes()), mock.WithHTTPStatusCode(resp.statusCode), mock.WithCallback(callback))

}
select {
case <-time.After(time.Second):
case <-time.After(waitTime):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the tests slow. Consider enabling them to skip the wait, for example by adding a hook like this:

var after = func(d time.Duration) <-chan time.Time {
	return time.After(d)
}

// in a test case
after = func(d time.Duration) <-chan time.Time {
    // TODO: validate d
    ch := make(chan time.Time, 1)
    ch <- time.Now()
    return ch
}

(has the additional benefit of preventing flakiness due to unpredictable scheduling)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, and added after.
This feels like we should always have a wrapper for time for testing all the time related things.

Copy link
Copy Markdown

@Robbie-Microsoft Robbie-Microsoft Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not addressed. The current diff still has case <-time.After(waitTime): with no overrideable hook. The new test cases actually sleep for real (1s + 2s + 4s = 7s for the IMDS exponential backoff case alone) and then validate wall-clock timing with ±500ms tolerance — exactly the flakiness concern raised here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants