-
Notifications
You must be signed in to change notification settings - Fork 115
allreduce benchmark
Batch size : 64 Number of batches per task: 50 Data set: cifar10 and image size is (32, 32, 3)
Worker resource: cpu=0.3,memory=2048Mi,ephemeral-storage=1024Mi
Resnet50 is a computation-intensive model and its trainable parameters number for cifar10 is 23,555,082.
Workers | computation/communication | Speed | Ratio |
---|---|---|---|
1 | - | 3.1 images/s | 1 |
2 | 10: 1 | 5.65 images/s | 1.82 |
1 worker (local):
Time profiling of one task on a worker:
[2020-07-22 09:35:40,175] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 1033.2 seconds
[2020-07-22 09:35:40,175] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 1032.73 seconds
[2020-07-22 09:35:40,175] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000842333 seconds
[2020-07-22 09:35:40,176] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 0.000983 seconds
"report_gradient time" is the time to average gradients by AllReduce.
Speed: 64 * 50 / 1033 = 3.1 images/s
2 workers:
Time profiling of one task on a worker:
[2020-07-22 05:31:33,884] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 1131.8 seconds
[2020-07-22 05:31:33,884] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 1127.86 seconds
[2020-07-22 05:31:33,885] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000961542 seconds
[2020-07-22 05:31:33,885] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 101.069 seconds
computation : communication = 10 : 1
Speed: 2 * 64 * 50 / 1131 = 5.65 images/s
Speed-up ratio: 5.65 / 3.1 = 1.82
MobileNetV2 is a communication-intensive model and its trainable parameters number of MoblieNetV2 is 2,236,682.
Workers | computation/communication | Speed | Ratio |
---|---|---|---|
1 | - | 29 images/s | 1 |
2 | 10: 3 | 44.7 images/s | 1.54 |
3 | 10: 6 | 57.2 images/s | 1.97 |
1 worker (local):
Time profiling of one task on a worker:
[2020-07-22 09:02:40,983] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 110.909 seconds
[2020-07-22 09:02:40,983] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 110.655 seconds
[2020-07-22 09:02:40,984] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.00154781 seconds
[2020-07-22 09:02:40,984] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 0.000884056 seconds
Speed: 64 * 50 / 110.9 = 29 images/s
2 workers
Time profiling of one task on a worker:
[2020-07-22 07:11:15,382] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 143.789 seconds
[2020-07-22 07:11:15,383] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 143.148 seconds
[2020-07-22 07:11:15,383] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000886917 seconds
[2020-07-22 07:11:15,384] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 33.6484 seconds
computation : communication = 10 : 3
Speed: 2 * 64 * 50 / 143.8 = 44.7 images/s
Speed-up ratio: 44.7 / 29 = 1.54
3 workers
Time profiling of one task on a worker:
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] task_process time is 167.943 seconds
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] batch_process time is 167.395 seconds
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] get_model time is 0.000771523 seconds
[2020-07-22 11:34:13,816] [DEBUG] [timing_utils.py:54:report_timing] report_gradient time is 60.0891 seconds
computation : communication = 10 : 6
Speed: 3 * 64 * 50 / 167.9 = 57.2 images/s
Speed-up ratio: 57.2 / 29 = 1.97
Worker resource: cpu=4,memory=8192Mi,ephemeral-storage=1024Mi
MobileNetV2
Workers | communication | Speed | Ratio |
---|---|---|---|
1 | - | 353.6 images/s | 1 |
2 | 24% | 503 images/s | 1.42 |
4 | 44.7% | 680 images/s | 1.92 |
8 | 66.7% | 648 images/s | 1.83 |
Resnet50
Workers | communication | Speed | Ratio |
---|---|---|---|
1 | - | 26.7 images/s | 1 |
2 | 18% | 41 images/s | 1.57 |
4 | 25% | 68.4 images/s | 2.56 |
8 | 32% | 123 images/s | 4.61 |