Multiple GPUS #15

LinasVidziunas · 2022-01-30T15:43:52Z

Multiple GPUs

Currently, even though multiple GPUs are assigned to the project we don't see any time improvements in the training.
Actually we see that the time gets worse by ~1 second for each epoch.
This might become more important once we incorporate multiple views and/or more complex models, as the training time should increase.

Suggestion

Implement what's called synchronous data parallelism, where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results[2]
Implementing this is not as easy as it seems in the first glance, as our datasets (x_train, x_test) have to be made into distributed datasets with tf.data.Dataset objects, as mentioned: "Importantly, we recommend that you use tf.data.Dataset objects to load data in a multi-device or distributed workflow."[2] Without datasets as tf.data.Dataset objects, alot of warnings get displayed before training starts. And also the training times seem to become way worse! (very bad) More information about creation of distributed datasets can be found in TF documentation chapter Custom training with tf.distribute.Strategy [4] under section Setup input pipeline

Quick Recap:

Synchronous data parallelism
tf.data.Dataset objects

Further more

Would be nice to split the models (cAE and vAE) from the rest of the code once the synchronous data parallelism is performed to reduce amount of lines in files and just to keep the source code organized.
Use variables such as BATCH_SIZE_PER_REPLICA and GLOBAL_BATCH_SIZE to avoid hard coding the number of GPUs in use. So that we can run 1 or 8 GPUs at the same time without having to change any of the python code. (Currently not sure if BATCH_SIZE_PER_REPLICA applies to our case, but anything that has to do with the number of GPUs should be automated)
Expect checkpoints to come soon, and make the code modular enough to be able to easily integrate them.

Resources

LinasVidziunas · 2022-01-31T10:27:54Z

Status report as of 31/01/22

New from this status report

Disable auto-shard in Model.Predict()
Model.predict() returns expected results even though it produces warnings about auto-shard (EDITED)

Quantitative data

Screenshots on messenger show 1 vs 3 GPUs epoch time for 20 epochs using the cAE on the main branch.
3 GPUs were approximately 1.8 times faster than 1 for the whole processes from start to finish. Per epoch times were reduced from 22s to 9s, 2.4 times faster. With more epochs or just larger networks I would expect the time for the whole process with 3 GPUs be closer to 2.4 times faster than one GPU.

Problems

On calling Model.predict() in cAE.py produces a warning, telling us auto-sharding should either be disabled or set to AutoShardPolicy.DATA. This is already done on the dataset! I have a suspicion that the warnings come from the internal Model.predict(), which can be overwritten, but it seems like hell to do it! Preferably the warnings should be disabled, but if the Model.predict() works as expected it might not be worth spending time on rewriting the Model.predict() function.
I'm currently running a test with 200 epochs to check whether Model.predict() returns expect results even though warnings are thrown.

EDIT 31/01/22 12:05

After running 200 epochs Model.predict() seems to return the expected results. I will call this a success, but keep in mind that this approach might be prone to errors later on. Checkboxes above have been edited correspondingly to this observation.

LinasVidziunas · 2022-01-31T11:06:43Z

Status update as of 07/02/22

Changed priority of this issue: Priority 3 -> Priority 4, due to currently not being of any significant importance.

LinasVidziunas · 2022-01-31T11:06:51Z

Reserved for status report.

LinasVidziunas added the enhancement New feature or request label Jan 30, 2022

LinasVidziunas self-assigned this Jan 30, 2022

LinasVidziunas added the Priority 3 The lower priority, the more important it is for the success of our bachelor. label Jan 31, 2022

LinasVidziunas added Priority 4 The lower priority, the more important it is for the success of our bachelor. and removed Priority 3 The lower priority, the more important it is for the success of our bachelor. labels Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple GPUS #15

Multiple GPUS #15

LinasVidziunas commented Jan 30, 2022

LinasVidziunas commented Jan 31, 2022 •

edited

Loading

LinasVidziunas commented Jan 31, 2022 •

edited

Loading

LinasVidziunas commented Jan 31, 2022

Multiple GPUS #15

Multiple GPUS #15

Comments

LinasVidziunas commented Jan 30, 2022

Multiple GPUs

Suggestion

Resources

LinasVidziunas commented Jan 31, 2022 • edited Loading

Status report as of 31/01/22

Quantitative data

Problems

EDIT 31/01/22 12:05

LinasVidziunas commented Jan 31, 2022 • edited Loading

Status update as of 07/02/22

LinasVidziunas commented Jan 31, 2022

LinasVidziunas commented Jan 31, 2022 •

edited

Loading

LinasVidziunas commented Jan 31, 2022 •

edited

Loading