Skip to content

jitter that guarantee that worker do not restart simultaneously #3427

@terezbw

Description

@terezbw

Hi everyone,
We use Gunicron with the following main arguments:
max-requests=10240
max-requests-jitter=512
workers=4
threads=25

This process runs in k8s.
We experience very short load spikes. Sometimes in the pod logs, we see all four worker processes restarting almost simultaneously. For example:

2025-10-09T17:09:51.472541603Z [2025-10-09 17:09:51 +0000] [28730] [INFO] Booting worker with pid: 28730
2025-10-09T17:09:54.403604539Z [2025-10-09 17:09:54 +0000] [28744] [INFO] Booting worker with pid: 28744
2025-10-09T17:09:54.923136603Z [2025-10-09 17:09:54 +0000] [28752] [INFO] Booting worker with pid: 28752
2025-10-09T17:09:55.019105033Z [2025-10-09 17:09:55 +0000] [28757] [INFO] Booting worker with pid: 28757

The container instantly loses all or almost all capacity for a service request and fails the k8s check.
Due to errors in our check configuration, they fire at very short intervals. For example, when the readiness check decides to bring endpoints back online, the liveness check decides to send a SIGTERM, and we get a 502 error in Kong. It's clear why and how to fix this. We'll change the sensitivity of the k8s checks for prevent that:

2025-10-09T17:10:03.890181347Z [2025-10-09 17:10:03 +0000] [1] [INFO] Handling signal: term
2025-10-09T17:10:03.930603482Z 10.244.8.144 - - [09/Oct/2025:17:10:03 +0000] "GET /XXXXXXXXXXXXXXXX HTTP/1.1" 200 5993 "-" "AGENT1"
2025-10-09T17:10:04.002456978Z 10.244.16.11 - - [09/Oct/2025:17:10:04 +0000] "GET /XXXXXXXXXXXXXXXXXXX HTTP/1.1" 200 6053 "-" "AGENT2"
2025-10-09T17:10:04.891834428Z [2025-10-09 17:10:04 +0000] [1] [INFO] Shutting down: Master

But this is a consequence; I want to ask how to fix the cause.
Does the jitter code guarantee that all worker processes will have different restart times each time? I assume we're dealing with the random() function:
https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/base.py#L58

Can anyone suggest a way to address this?
Increase jitter and the max_requests? But this will only reduce the probability of this behavior.
We're already using our own version of worker.py and could add a fair jitter here. But is this even possible?
Does a gunicorn worker have a numeric object ID that numbers the worker?
For example, if max_workers=4, then ID is in range 0-3?
We can use code to calculate max_request as follows:
base_max_requests + worker_id* CONST
where CONST is the desired interval between restarts.
Then worker0 will have 10240 max_requests,
worker1 - 10240+CONST, and so on.
But here we're talking about "requests," not time. What if one worker is lazier than other and executes 10240 requests at the same time as a faster worker executes 10240+500 requests? We may again get a simultaneous restart of the most workers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions