-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple GPUS #15
Comments
Status report as of 31/01/22
New from this status report
Quantitative dataScreenshots on messenger show 1 vs 3 GPUs epoch time for 20 epochs using the cAE on the main branch. ProblemsOn calling EDIT 31/01/22 12:05After running 200 epochs |
Status update as of 07/02/22Changed priority of this issue: |
Reserved for status report. |
Multiple GPUs
Currently, even though multiple GPUs are assigned to the project we don't see any time improvements in the training.
Actually we see that the time gets worse by ~1 second for each epoch.
This might become more important once we incorporate multiple views and/or more complex models, as the training time should increase.
Suggestion
Implement what's called synchronous data parallelism, where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results[2]
Implementing this is not as easy as it seems in the first glance, as our datasets (x_train, x_test) have to be made into distributed datasets with tf.data.Dataset objects, as mentioned: "Importantly, we recommend that you use tf.data.Dataset objects to load data in a multi-device or distributed workflow."[2] Without datasets as tf.data.Dataset objects, alot of warnings get displayed before training starts. And also the training times seem to become way worse! (very bad) More information about creation of distributed datasets can be found in TF documentation chapter Custom training with tf.distribute.Strategy [4] under section Setup input pipeline
Quick Recap:
Further more
Resources
The text was updated successfully, but these errors were encountered: