Skip to content

Checkpoints #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
LinasVidziunas opened this issue Jan 30, 2022 · 1 comment
Open

Checkpoints #16

LinasVidziunas opened this issue Jan 30, 2022 · 1 comment
Labels
enhancement New feature or request Priority 4 The lower priority, the more important it is for the success of our bachelor.

Comments

@LinasVidziunas
Copy link
Owner

LinasVidziunas commented Jan 30, 2022

Checkpoints

Probably smart to incorporate checkpoints in our training. I've currently been running a process for 13 hours (probably overkill). As we're going to use multiple views later in the project, we can expect the training times to increase by a lot. Any processes running for over 24 hours will be automatically terminated by SLURM, if no checkpoints have been made, all of the work in the previous 24 hours is gone. Also the GPU clusters at UiS seem to have power failures every few months. Just nice to be prepared.

TASKS:

  • Automate checkpoint creation for every x (variable) epochs.
  • Automate restoring from a checkpoint.

Would especially recommend to look at resource number 2 as it's simple and is being used together with distributed training (or other resources that also look at distributed training).
When implementing this expect distributed training (multiple GPUs) to be implemented in the near future.
Also might be smart to split up from the cAE.py file when programming this and leaving cAE.py to (more or less) only contain the cAE model. Later the cAE.py file might be renamed to models.py and include other models such as VAE.

Resources

  1. https://keras.io/api/callbacks/model_checkpoint/
  2. https://keras.io/guides/distributed_training/#using-callbacks-to-ensure-fault-tolerance
  3. https://www.tensorflow.org/tutorials/distribute/keras#define_the_callbacks
  4. https://www.tensorflow.org/tutorials/distribute/custom_training#training_loop
@LinasVidziunas LinasVidziunas added enhancement New feature or request Priority 3 The lower priority, the more important it is for the success of our bachelor. labels Jan 30, 2022
@LinasVidziunas LinasVidziunas added Priority 4 The lower priority, the more important it is for the success of our bachelor. and removed Priority 3 The lower priority, the more important it is for the success of our bachelor. labels Feb 7, 2022
@LinasVidziunas
Copy link
Owner Author

Status update as of 07/02/22

Changed priority of this issue: Priority 3 -> Priority 4, due to currently not being of any significant importance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority 4 The lower priority, the more important it is for the success of our bachelor.
Projects
None yet
Development

No branches or pull requests

1 participant