Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timelimit callback #114

Open
HCookie opened this issue Feb 4, 2025 · 0 comments · May be fixed by #115
Open

Add timelimit callback #114

HCookie opened this issue Feb 4, 2025 · 0 comments · May be fixed by #115
Assignees
Labels
enhancement New feature or request training

Comments

@HCookie
Copy link
Member

HCookie commented Feb 4, 2025

Is your feature request related to a problem? Please describe.

Many slurm based systems will kill a training job before it can safely complete. It would be good to preemt this.

Describe the solution you'd like

A callback with a time limit, and when exceeded, saves a checkpoint, kills the job and write to a file.

Describe alternatives you've considered

No response

Additional context

No response

Organisation

ECMWF

@HCookie HCookie added enhancement New feature or request training labels Feb 4, 2025
@HCookie HCookie self-assigned this Feb 4, 2025
@HCookie HCookie linked a pull request Feb 4, 2025 that will close this issue
@HCookie HCookie linked a pull request Feb 4, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant