Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple GPUS #15

Open
LinasVidziunas opened this issue Jan 30, 2022 · 3 comments
Open

Multiple GPUS #15

LinasVidziunas opened this issue Jan 30, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request Priority 4 The lower priority, the more important it is for the success of our bachelor.

Comments

@LinasVidziunas
Copy link
Owner

Multiple GPUs

Currently, even though multiple GPUs are assigned to the project we don't see any time improvements in the training.
Actually we see that the time gets worse by ~1 second for each epoch.
This might become more important once we incorporate multiple views and/or more complex models, as the training time should increase.

Suggestion

Implement what's called synchronous data parallelism, where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results[2]
Implementing this is not as easy as it seems in the first glance, as our datasets (x_train, x_test) have to be made into distributed datasets with tf.data.Dataset objects, as mentioned: "Importantly, we recommend that you use tf.data.Dataset objects to load data in a multi-device or distributed workflow."[2] Without datasets as tf.data.Dataset objects, alot of warnings get displayed before training starts. And also the training times seem to become way worse! (very bad) More information about creation of distributed datasets can be found in TF documentation chapter Custom training with tf.distribute.Strategy [4] under section Setup input pipeline

Quick Recap:

  • Synchronous data parallelism
  • tf.data.Dataset objects

Further more

  • Would be nice to split the models (cAE and vAE) from the rest of the code once the synchronous data parallelism is performed to reduce amount of lines in files and just to keep the source code organized.
  • Use variables such as BATCH_SIZE_PER_REPLICA and GLOBAL_BATCH_SIZE to avoid hard coding the number of GPUs in use. So that we can run 1 or 8 GPUs at the same time without having to change any of the python code. (Currently not sure if BATCH_SIZE_PER_REPLICA applies to our case, but anything that has to do with the number of GPUs should be automated)
  • Expect checkpoints to come soon, and make the code modular enough to be able to easily integrate them.

Resources

  1. https://www.tensorflow.org/tutorials/distribute/keras#define_the_distribution_strategy
  2. https://keras.io/guides/distributed_training/#using-callbacks-to-ensure-fault-tolerance
  3. https://www.coursera.org/lecture/custom-distributed-training-with-tensorflow/custom-training-for-multiple-gpu-mirrored-strategy-EDiRd
  4. https://www.tensorflow.org/tutorials/distribute/custom_training
@LinasVidziunas LinasVidziunas added the enhancement New feature or request label Jan 30, 2022
@LinasVidziunas LinasVidziunas self-assigned this Jan 30, 2022
@LinasVidziunas
Copy link
Owner Author

LinasVidziunas commented Jan 31, 2022

Status report as of 31/01/22

  • Synchronous data parallelism
  • tf.data.Dataset.objects
  • Code organization
  • Variables such as BATCH_SIZE_PER_REPLICA
  • Making ready for checkpoints

New from this status report

  • Disable auto-shard in Model.Predict()
  • Model.predict() returns expected results even though it produces warnings about auto-shard (EDITED)

Quantitative data

Screenshots on messenger show 1 vs 3 GPUs epoch time for 20 epochs using the cAE on the main branch.
3 GPUs were approximately 1.8 times faster than 1 for the whole processes from start to finish. Per epoch times were reduced from 22s to 9s, 2.4 times faster. With more epochs or just larger networks I would expect the time for the whole process with 3 GPUs be closer to 2.4 times faster than one GPU.

Problems

On calling Model.predict() in cAE.py produces a warning, telling us auto-sharding should either be disabled or set to AutoShardPolicy.DATA. This is already done on the dataset! I have a suspicion that the warnings come from the internal Model.predict(), which can be overwritten, but it seems like hell to do it! Preferably the warnings should be disabled, but if the Model.predict() works as expected it might not be worth spending time on rewriting the Model.predict() function.
I'm currently running a test with 200 epochs to check whether Model.predict() returns expect results even though warnings are thrown.

EDIT 31/01/22 12:05

After running 200 epochs Model.predict() seems to return the expected results. I will call this a success, but keep in mind that this approach might be prone to errors later on. Checkboxes above have been edited correspondingly to this observation.

@LinasVidziunas
Copy link
Owner Author

LinasVidziunas commented Jan 31, 2022

Status update as of 07/02/22

Changed priority of this issue: Priority 3 -> Priority 4, due to currently not being of any significant importance.

@LinasVidziunas
Copy link
Owner Author

Reserved for status report.

@LinasVidziunas LinasVidziunas added the Priority 3 The lower priority, the more important it is for the success of our bachelor. label Jan 31, 2022
@LinasVidziunas LinasVidziunas added Priority 4 The lower priority, the more important it is for the success of our bachelor. and removed Priority 3 The lower priority, the more important it is for the success of our bachelor. labels Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority 4 The lower priority, the more important it is for the success of our bachelor.
Projects
None yet
Development

No branches or pull requests

1 participant