Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-config-exec(5) does not contain much detail about the kill timer #6606

Closed
garlick opened this issue Feb 5, 2025 · 4 comments · Fixed by #6640
Closed

flux-config-exec(5) does not contain much detail about the kill timer #6606

garlick opened this issue Feb 5, 2025 · 4 comments · Fixed by #6640

Comments

@garlick
Copy link
Member

garlick commented Feb 5, 2025

Problem: a question arose about the algorithm used by job-exec to kill processes, and flux-config-exec(5) wasn't super helpful.

Currently we have:

      kill-timeout
              (optional) The amount of time to wait after SIGTERM is sent to a
              job before sending SIGKILL.
      term-signal
              (optional) Specify an alternate signal to SIGTERM when terminat‐
              ing job tasks. Mainly used for testing.
      kill-signal
              (optional) Specify an alternate signal to SIGKILL  when  killing
              tasks and the job shell. Mainly used for testing.

Some observations

  • default value for kill-timeout (5s) is missing
  • max-kill-count is missing entirely (it has a default of 8)
  • the kill algorithm isn't described

Maybe on the latter, we could tack on something like

JOB TERMINATION

When a job is canceled or its time limit is reached, jobs are terminated using the following sequence:

  • The job shells are sent SIGTERM
  • After kill-timeout, any remaining shells are sent SIGKILL
  • This continues with an exponential back-off, with kill-timeout doubling after each attempt (capped at 300s)
  • After max-kill-count attempts, any nodes still running processes are drained

Just enough to help people choose the right values for the configuration parameters.

@grondo
Copy link
Contributor

grondo commented Feb 5, 2025

For reference, PR #6101 added max-kill-count and failed to document it :-(

@garlick
Copy link
Member Author

garlick commented Feb 5, 2025

The man page also doesn't mention that kill-timeout is an FSD or that kill-signal and term-signal are strings rather than signal numbers.

I suppose it could also mention that SIGKILL is replaced with SIGUSR1 when the job is started by the IMP.

@grondo
Copy link
Contributor

grondo commented Feb 5, 2025

I suppose it could also mention that SIGKILL is replaced with SIGUSR1 when the job is started by the IMP.

That could be confusing because the net effect is that the job processes are sent SIGKILL?

@garlick
Copy link
Member Author

garlick commented Feb 5, 2025

Yeah if we go there it will require a bit more explanation I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants