K8s client does not retry some 5xx errors #5829

bentsherman · 2025-02-28T16:37:27Z

If it's returning a 500 error code then Nextflow should be retrying it:

nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/client/K8sClient.groovy

Lines 629 to 639 in 0aa76af
try {
return makeRequestCall( method, path, body )
} catch ( K8sResponseException | SocketException | SocketTimeoutException e ) {
if ( e instanceof K8sResponseException && e.response.code != 500 )
throw e
if ( ++attempt > maxRetries )
throw e
log.debug "[K8s] API request threw socket exception: $e.message for $method $path - Retrying request (attempt=$attempt)"
final long delay = (Math.pow(3, attempt - 1) as long) * 250
sleep( delay )
}

I wonder if it is retrying but the tunnel disconnect lasts longer than the duration of the 8 retries. Seems plausible if it is a regular outage. Can you see the "API request threw socket exception..." messages in your log?

@bentsherman, apologies if this should be raised a separate question/issue but this issue and conversation seem to be closely related to question/issue I am having.

I am running in a cluster that sometimes under heavy load does return temporary 503 errors and this causes nextflow to terminate and this can be seen in the logs:
`Request GET /api...returned an error code=503...`
I seems that similar codes have been switched to retryable for AWS Batch #4709.

I am using nextflow 24.04.3 and when the 503 occurs and the workflow exits the running processes are not terminated. Testing with 24.12.0-edge that has fix #5561 resolves that and terminates running processes even if 503 occurs.

Is there a reason why 503 is not also added as a retryable exit code as for AWS Batch?

Is this something that could potentially be added or a k8s setting that would allow the user to specify which errors to treat as retryable k8s.retryable_errors = [500, 502, 503 ]?

Originally posted by @crossthet in #5604

The text was updated successfully, but these errors were encountered:

bentsherman · 2025-02-28T16:41:00Z

@jorgee when you have some time, can you update the K8s client to handle the same 5xx errors as AWS Batch? Would also be a good chance to use the same failsafe RetryPolicy pattern here like we do throughout the codebase

bentsherman added the executor/k8s label Feb 28, 2025

jorgee self-assigned this Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s client does not retry some 5xx errors #5829

K8s client does not retry some 5xx errors #5829

bentsherman commented Feb 28, 2025

bentsherman commented Feb 28, 2025

K8s client does not retry some 5xx errors #5829

K8s client does not retry some 5xx errors #5829

Comments

bentsherman commented Feb 28, 2025

bentsherman commented Feb 28, 2025