Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s client does not retry some 5xx errors #5829

Open
bentsherman opened this issue Feb 28, 2025 · 1 comment
Open

K8s client does not retry some 5xx errors #5829

bentsherman opened this issue Feb 28, 2025 · 1 comment
Assignees

Comments

@bentsherman
Copy link
Member

If it's returning a 500 error code then Nextflow should be retrying it:

nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/client/K8sClient.groovy

Lines 629 to 639 in 0aa76af
try {
return makeRequestCall( method, path, body )
} catch ( K8sResponseException | SocketException | SocketTimeoutException e ) {
if ( e instanceof K8sResponseException && e.response.code != 500 )
throw e
if ( ++attempt > maxRetries )
throw e
log.debug "[K8s] API request threw socket exception: $e.message for $method $path - Retrying request (attempt=$attempt)"
final long delay = (Math.pow(3, attempt - 1) as long) * 250
sleep( delay )
}

I wonder if it is retrying but the tunnel disconnect lasts longer than the duration of the 8 retries. Seems plausible if it is a regular outage. Can you see the "API request threw socket exception..." messages in your log?

@bentsherman, apologies if this should be raised a separate question/issue but this issue and conversation seem to be closely related to question/issue I am having.

I am running in a cluster that sometimes under heavy load does return temporary 503 errors and this causes nextflow to terminate and this can be seen in the logs:

`Request GET /api...returned an error code=503...`

I seems that similar codes have been switched to retryable for AWS Batch #4709.

I am using nextflow 24.04.3 and when the 503 occurs and the workflow exits the running processes are not terminated. Testing with 24.12.0-edge that has fix #5561 resolves that and terminates running processes even if 503 occurs.

Is there a reason why 503 is not also added as a retryable exit code as for AWS Batch?

Is this something that could potentially be added or a k8s setting that would allow the user to specify which errors to treat as retryable k8s.retryable_errors = [500, 502, 503 ]?

Originally posted by @crossthet in #5604

@bentsherman
Copy link
Member Author

@jorgee when you have some time, can you update the K8s client to handle the same 5xx errors as AWS Batch? Would also be a good chance to use the same failsafe RetryPolicy pattern here like we do throughout the codebase

@jorgee jorgee self-assigned this Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants