Skip to content

allreduce benchmark

workingloong edited this page Jul 23, 2020 · 15 revisions

AllReduce Benchmark

Minikube

Batch size : 64 Number of batches per task: 50 Data set: cifar10 and image size is (32, 32, 3)

Worker resource: cpu=0.3,memory=2048Mi,ephemeral-storage=1024Mi

Resnet50

Resnet50 is a computation-intensive model and its trainable parameters number for cifar10 is 23,555,082.

Workers computation/communication Speed Ratio
1 - 3.1 images/s 1
2 10: 1 5.65 images/s 1.82

1 worker (local):

Time profiling of one task on a worker:

[2020-07-22 09:35:40,175] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 1033.2 seconds
[2020-07-22 09:35:40,175] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 1032.73 seconds
[2020-07-22 09:35:40,175] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000842333 seconds
[2020-07-22 09:35:40,176] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 0.000983 seconds

"report_gradient time" is the time to average gradients by AllReduce.

Speed: 64 * 50 / 1033 = 3.1 images/s

2 workers:

Time profiling of one task on a worker:

[2020-07-22 05:31:33,884] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 1131.8 seconds
[2020-07-22 05:31:33,884] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 1127.86 seconds
[2020-07-22 05:31:33,885] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000961542 seconds
[2020-07-22 05:31:33,885] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 101.069 seconds

computation : communication = 10 : 1

Speed: 2 * 64 * 50 / 1131 = 5.65 images/s

Speed-up ratio: 5.65 / 3.1 = 1.82

MobileNetV2

MobileNetV2 is a communication-intensive model and its trainable parameters number of MoblieNetV2 is 2,236,682.

Workers computation/communication Speed Ratio
1 - 29 images/s 1
2 10: 3 44.7 images/s 1.54
3 10: 6 57.2 images/s 1.97

1 worker (local):

Time profiling of one task on a worker:

[2020-07-22 09:02:40,983] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 110.909 seconds
[2020-07-22 09:02:40,983] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 110.655 seconds
[2020-07-22 09:02:40,984] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.00154781 seconds
[2020-07-22 09:02:40,984] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 0.000884056 seconds

Speed: 64 * 50 / 110.9 = 29 images/s

2 workers

Time profiling of one task on a worker:

[2020-07-22 07:11:15,382] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 143.789 seconds
[2020-07-22 07:11:15,383] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 143.148 seconds
[2020-07-22 07:11:15,383] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000886917 seconds
[2020-07-22 07:11:15,384] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 33.6484 seconds

computation : communication = 10 : 3

Speed: 2 * 64 * 50 / 143.8 = 44.7 images/s

Speed-up ratio: 44.7 / 29 = 1.54

3 workers

Time profiling of one task on a worker:

[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 167.943 seconds
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 167.395 seconds
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000771523 seconds
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 60.0891 seconds

computation : communication = 10 : 6

Speed: 3 * 64 * 50 / 167.9 = 57.2 images/s

Speed-up ratio: 57.2 / 29 = 1.97

ASI

Worker resource: cpu=4,memory=8192Mi,ephemeral-storage=1024Mi

MobileNetV2

Workers communication Speed Ratio
1 - 353.6 images/s 1
2 24% 503 images/s 1.42
4 44.7% 680 images/s 1.92
8 66.7% 648 images/s 1.83

Resnet50

Workers communication Speed Ratio
1 - 26.7 images/s 1
2 18% 41 images/s 1.57
4 25% 68.4 images/s 2.56
8 32% 123 images/s 4.61