You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem: a question arose about the algorithm used by job-exec to kill processes, and flux-config-exec(5) wasn't super helpful.
Currently we have:
kill-timeout
(optional) The amount of time to wait after SIGTERM is sent to a
job before sending SIGKILL.
term-signal
(optional) Specify an alternate signal to SIGTERM when terminat‐
ing job tasks. Mainly used for testing.
kill-signal
(optional) Specify an alternate signal to SIGKILL when killing
tasks and the job shell. Mainly used for testing.
Some observations
default value for kill-timeout (5s) is missing
max-kill-count is missing entirely (it has a default of 8)
the kill algorithm isn't described
Maybe on the latter, we could tack on something like
JOB TERMINATION
When a job is canceled or its time limit is reached, jobs are terminated using the following sequence:
The job shells are sent SIGTERM
After kill-timeout, any remaining shells are sent SIGKILL
This continues with an exponential back-off, with kill-timeout doubling after each attempt (capped at 300s)
After max-kill-count attempts, any nodes still running processes are drained
Just enough to help people choose the right values for the configuration parameters.
The text was updated successfully, but these errors were encountered:
Problem: a question arose about the algorithm used by
job-exec
to kill processes, and flux-config-exec(5) wasn't super helpful.Currently we have:
Some observations
kill-timeout
(5s) is missingmax-kill-count
is missing entirely (it has a default of 8)Maybe on the latter, we could tack on something like
Just enough to help people choose the right values for the configuration parameters.
The text was updated successfully, but these errors were encountered: