Skip to content

allreduce benchmark

Qinlong Wang edited this page Jul 28, 2020 · 15 revisions

AllReduce Benchmark

Minikube

Batch size : 64 Number of batches per task: 50 Data set: cifar10 and image size is (32, 32, 3)

Worker resource: cpu=0.3,memory=2048Mi,ephemeral-storage=1024Mi

Resnet50

Resnet50 is a computation-intensive model and its trainable parameters number for cifar10 is 23,555,082.

Workers computation/communication Speed Speedup Ratio
1 0% 3.1 images/s 1
2 10: 1 5.65 images/s 1.82

1 worker (local):

Time profiling of one task on a worker:

[2020-07-22 09:35:40,175] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 1033.2 seconds
[2020-07-22 09:35:40,175] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 1032.73 seconds
[2020-07-22 09:35:40,175] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000842333 seconds
[2020-07-22 09:35:40,176] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 0.000983 seconds

"report_gradient time" is the time to average gradients by AllReduce.

Speed: 64 * 50 / 1033 = 3.1 images/s

2 workers:

Time profiling of one task on a worker:

[2020-07-22 05:31:33,884] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 1131.8 seconds
[2020-07-22 05:31:33,884] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 1127.86 seconds
[2020-07-22 05:31:33,885] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000961542 seconds
[2020-07-22 05:31:33,885] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 101.069 seconds

computation : communication = 10 : 1

Speed: 2 * 64 * 50 / 1131 = 5.65 images/s

Speed-up ratio: 5.65 / 3.1 = 1.82

MobileNetV2

MobileNetV2 is a communication-intensive model and its trainable parameters number of MoblieNetV2 is 2,236,682.

Workers computation/communication Speed Speedup Ratio
1 - 29 images/s 1
2 10: 3 44.7 images/s 1.54
3 10: 6 57.2 images/s 1.97

1 worker (local):

Time profiling of one task on a worker:

[2020-07-22 09:02:40,983] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 110.909 seconds
[2020-07-22 09:02:40,983] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 110.655 seconds
[2020-07-22 09:02:40,984] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.00154781 seconds
[2020-07-22 09:02:40,984] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 0.000884056 seconds

Speed: 64 * 50 / 110.9 = 29 images/s

2 workers

Time profiling of one task on a worker:

[2020-07-22 07:11:15,382] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 143.789 seconds
[2020-07-22 07:11:15,383] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 143.148 seconds
[2020-07-22 07:11:15,383] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000886917 seconds
[2020-07-22 07:11:15,384] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 33.6484 seconds

computation : communication = 10 : 3

Speed: 2 * 64 * 50 / 143.8 = 44.7 images/s

Speed-up ratio: 44.7 / 29 = 1.54

3 workers

Time profiling of one task on a worker:

[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 167.943 seconds
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 167.395 seconds
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000771523 seconds
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 60.0891 seconds

computation : communication = 10 : 6

Speed: 3 * 64 * 50 / 167.9 = 57.2 images/s

Speed-up ratio: 57.2 / 29 = 1.97

ASI

CPU only

Worker resource: cpu=4,memory=8192Mi,ephemeral-storage=1024Mi

MobileNetV2

Workers communication Speed Speedup Ratio
1 0% 353.6 images/s 1
2 24% 503 images/s 1.42
4 44.7% 680 images/s 1.92
8 66.7% 648 images/s 1.83

Resnet50

Workers communication Speed Speedup Ratio
1 0% 26.7 images/s 1
2 18% 41 images/s 1.57
4 25% 68.4 images/s 2.56
8 32% 123 images/s 4.61

GPU

Data: ImageNet shape (256, 256, 3) mini-batch size : 64

A task per 16 minibatches

MobileNetV2

1024 images/task

Workers total task time allreduce time tensor.numpy() time apply_gradients forward + backward
1 (local) 6.06s - - 5.59s 0.47s
2 8.34s 7.25026 5.79s 0.6s 0.49s
4 10.2029s 8.9s 5.78s 0.71s 0.49s

MobileNetV2

Resnet50

Workers total task time allreduce time tensor.numpy() time apply_gradients
1 (local) 6.1s - - 4.16s
2 13.76s 10.36s 5.04s 1.35s
4 18s 14.67s 5.14s 1.30s

Resnet50

Clone this wiki locally