We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many slurm based systems will kill a training job before it can safely complete. It would be good to preemt this.
A callback with a time limit, and when exceeded, saves a checkpoint, kills the job and write to a file.
No response
ECMWF
The text was updated successfully, but these errors were encountered:
HCookie
Successfully merging a pull request may close this issue.
Is your feature request related to a problem? Please describe.
Many slurm based systems will kill a training job before it can safely complete. It would be good to preemt this.
Describe the solution you'd like
A callback with a time limit, and when exceeded, saves a checkpoint, kills the job and write to a file.
Describe alternatives you've considered
No response
Additional context
No response
Organisation
ECMWF
The text was updated successfully, but these errors were encountered: