Skip to content

Commit 280a1f9

Browse files
Merge pull request ilkarman#67 from ilkarman/ikdev2
Multi-GPU in-RAM
2 parents 9b60b88 + f777e24 commit 280a1f9

File tree

5 files changed

+435
-135
lines changed

5 files changed

+435
-135
lines changed

README.md

+12-6
Original file line numberDiff line numberDiff line change
@@ -50,14 +50,20 @@ Input for this model is the standard [CIFAR-10 dataset](http://www.cs.toronto.ed
5050

5151
**This is a work in progress**
5252

53-
| DL Library | 1xP100/CUDA 9/CuDNN 7 | 2xP100/CUDA 9/CuDNN 7 | 4xP100/CUDA 9/CuDNN 7 |
54-
| ----------------------------------------------- | :------------------: | :-------------------: | :------------------: |
55-
| [Pytorch](notebooks/PyTorch_MultiGPU.ipynb) | 41min46s | 28min50s | 23min31s |
56-
| [Keras(TF)](notebooks/Keras_TF_MultiGPU.ipynb) | 51min27s | 32min1s | 23min3s |
57-
| [Tensorflow](notebooks/Tensorflow_MultiGPU.ipynb) | 62min8s | 44min13s | 33min |
53+
**CUDA 9/CuDNN 7.0**
5854

55+
| DL Library | 1xP100 | 2xP100 | 4xP100 | **4xP100 Synthetic Data** |
56+
| ----------------------------------------------- | :------------------: | :-------------------: | :------------------: | :------------------: |
57+
| [Pytorch](notebooks/PyTorch_MultiGPU.ipynb) | 41min46s | 28min50s | 23min7s | 11min48s |
58+
| [Keras(TF)](notebooks/Keras_TF_MultiGPU.ipynb) | 51min27s | 32min1s | 22min49s | 18min30s |
59+
| [Tensorflow](notebooks/Tensorflow_MultiGPU.ipynb) | 62min8s | 44min13s | 31min4s | 17min10s |
60+
| [Chainer]() | ? | ? | ? | ? |
61+
| [MXNet]() | ? | ? | ? | ? |
5962

60-
Input for this model is 112,120 PNGs of chest X-rays. **Note for the notebook to automatically download the data you must install [Azcopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux#download-and-install-azcopy) and increase the size of your OS-Disk in Azure Portal so that you have at-least 45GB of free-space (the Chest X-ray data is large!). The notebooks may take more than 10 minutes to first download the data.** These notebooks train DenseNet-121 and use native data-loaders to pre-process the data and perform data-augmentation. We want to rewrite the data-loaders to use OpenCV instead of PIL to reduce IO-bottlenecking.
63+
64+
Input for this model is 112,120 PNGs of chest X-rays. **Note for the notebook to automatically download the data you must install [Azcopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux#download-and-install-azcopy) and increase the size of your OS-Disk in Azure Portal so that you have at-least 45GB of free-space (the Chest X-ray data is large!). The notebooks may take more than 10 minutes to first download the data.** These notebooks train DenseNet-121 and use native data-loaders to pre-process the data and perform data-augmentation.
65+
66+
Comparing synthetic data to actual PNG files we can estimate the IO lag for **PyTorch (~11min), Keras(TF) (~4min), Tensorflow (~13min)!** We need to investigate this to establish the most performant data-loading pipeline and any **help is appreciated**. The current plan is to write functions in OpenCV (or perhaps use ChainerCV) and share between all frameworks.
6167

6268
### 3. Avg Time(s) for 1000 images: ResNet-50 - Feature Extraction
6369

notebooks/Keras_TF_MultiGPU.ipynb

+87-20
Original file line numberDiff line numberDiff line change
@@ -164,8 +164,8 @@
164164
"Please make sure to download\n",
165165
"https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux#download-and-install-azcopy\n",
166166
"Data already exists\n",
167-
"CPU times: user 708 ms, sys: 228 ms, total: 936 ms\n",
168-
"Wall time: 936 ms\n"
167+
"CPU times: user 711 ms, sys: 209 ms, total: 920 ms\n",
168+
"Wall time: 919 ms\n"
169169
]
170170
}
171171
],
@@ -364,8 +364,8 @@
364364
"name": "stdout",
365365
"output_type": "stream",
366366
"text": [
367-
"CPU times: user 1min 26s, sys: 6.1 s, total: 1min 33s\n",
368-
"Wall time: 1min 30s\n"
367+
"CPU times: user 1min 22s, sys: 5.25 s, total: 1min 27s\n",
368+
"Wall time: 1min 25s\n"
369369
]
370370
}
371371
],
@@ -390,8 +390,8 @@
390390
"name": "stdout",
391391
"output_type": "stream",
392392
"text": [
393-
"CPU times: user 35.7 ms, sys: 3.59 ms, total: 39.3 ms\n",
394-
"Wall time: 37.2 ms\n"
393+
"CPU times: user 32.6 ms, sys: 3.71 ms, total: 36.4 ms\n",
394+
"Wall time: 35.1 ms\n"
395395
]
396396
}
397397
],
@@ -411,24 +411,23 @@
411411
"output_type": "stream",
412412
"text": [
413413
"Epoch 1/5\n",
414-
"342/342 [==============================] - 342s 999ms/step - loss: 0.1807 - val_loss: 0.1685\n",
414+
"342/342 [==============================] - 334s 977ms/step - loss: 0.1810 - val_loss: 0.1636\n",
415415
"Epoch 2/5\n",
416-
"342/342 [==============================] - 254s 742ms/step - loss: 0.1522 - val_loss: 0.1488\n",
416+
"342/342 [==============================] - 249s 729ms/step - loss: 0.1514 - val_loss: 0.1432\n",
417417
"Epoch 3/5\n",
418-
"342/342 [==============================] - 251s 733ms/step - loss: 0.1485 - val_loss: 0.1463\n",
418+
"342/342 [==============================] - 250s 731ms/step - loss: 0.1481 - val_loss: 0.1457\n",
419419
"Epoch 4/5\n",
420-
"342/342 [==============================] - ETA: 0s - loss: 0.145 - 245s 717ms/step - loss: 0.1458 - val_loss: 0.1481\n",
420+
"342/342 [==============================] - 251s 734ms/step - loss: 0.1458 - val_loss: 0.1438\n",
421421
"Epoch 5/5\n",
422-
"341/342 [============================>.] - ETA: 0s - loss: 0.1446Epoch 5/5\n",
423-
"342/342 [==============================] - 252s 738ms/step - loss: 0.1447 - val_loss: 0.1387\n",
424-
"CPU times: user 1h 6min 48s, sys: 23min 26s, total: 1h 30min 14s\n",
425-
"Wall time: 23min 3s\n"
422+
"342/342 [==============================] - 247s 721ms/step - loss: 0.1440 - val_loss: 0.1418\n",
423+
"CPU times: user 1h 7min 8s, sys: 23min 4s, total: 1h 30min 12s\n",
424+
"Wall time: 22min 49s\n"
426425
]
427426
},
428427
{
429428
"data": {
430429
"text/plain": [
431-
"<keras.callbacks.History at 0x7f319d282860>"
430+
"<keras.callbacks.History at 0x7fdf6e7143c8>"
432431
]
433432
},
434433
"execution_count": 20,
@@ -440,7 +439,7 @@
440439
"%%time\n",
441440
"# 1 GPU - Main training loop: 51min 27s\n",
442441
"# 2 GPU - Main training loop: 32min 1s\n",
443-
"# 4 GPU - Main training loop: 23min 3s\n",
442+
"# 4 GPU - Main training loop: 22min 49s\n",
444443
"model.fit_generator(train_dataset,\n",
445444
" epochs=EPOCHS,\n",
446445
" verbose=1,\n",
@@ -481,8 +480,8 @@
481480
"name": "stdout",
482481
"output_type": "stream",
483482
"text": [
484-
"CPU times: user 5min 40s, sys: 1min 48s, total: 7min 29s\n",
485-
"Wall time: 2min 16s\n"
483+
"CPU times: user 5min 35s, sys: 1min 44s, total: 7min 20s\n",
484+
"Wall time: 2min 14s\n"
486485
]
487486
}
488487
],
@@ -502,14 +501,82 @@
502501
"name": "stdout",
503502
"output_type": "stream",
504503
"text": [
505-
"Full AUC [0.8166704403329452, 0.8701978640353484, 0.8036644715384587, 0.8991700123597787, 0.8900824513691525, 0.9197848609229234, 0.7292038166231667, 0.8975747639269652, 0.6324781069481422, 0.8465972198647057, 0.7451801565774874, 0.8049089113120023, 0.7560819980737239, 0.8914456631015975]\n",
506-
"Validation AUC: 0.8216\n"
504+
"Full AUC [0.810400224263596, 0.8642047989855159, 0.801330086449206, 0.9072074321344181, 0.8906798540400607, 0.9213575843667169, 0.7088805005859234, 0.9128299199053916, 0.6267736564423316, 0.8542487673046052, 0.7531549949370517, 0.803228785418665, 0.7709379338811964, 0.8884575500057307]\n",
505+
"Validation AUC: 0.8224\n"
507506
]
508507
}
509508
],
510509
"source": [
511510
"print(\"Validation AUC: {0:.4f}\".format(compute_roc_auc(test_dataset.classes, y_guess, CLASSES)))"
512511
]
512+
},
513+
{
514+
"cell_type": "code",
515+
"execution_count": 25,
516+
"metadata": {},
517+
"outputs": [],
518+
"source": [
519+
"#####################################################################################################\n",
520+
"## Synthetic Data (Pure Training)"
521+
]
522+
},
523+
{
524+
"cell_type": "code",
525+
"execution_count": 26,
526+
"metadata": {},
527+
"outputs": [],
528+
"source": [
529+
"# Test on fake-data -> no IO lag\n",
530+
"batch_in_epoch = train_dataset.n//BATCHSIZE\n",
531+
"tot_num = batch_in_epoch * BATCHSIZE\n",
532+
"fake_X = np.random.rand(tot_num, 3, 224, 224).astype(np.float32)\n",
533+
"fake_y = np.random.rand(tot_num, CLASSES).astype(np.float32) "
534+
]
535+
},
536+
{
537+
"cell_type": "code",
538+
"execution_count": 29,
539+
"metadata": {},
540+
"outputs": [
541+
{
542+
"name": "stdout",
543+
"output_type": "stream",
544+
"text": [
545+
"Epoch 1/5\n",
546+
"87296/87296 [==============================] - 224s 3ms/step - loss: 0.6933\n",
547+
"Epoch 2/5\n",
548+
"87296/87296 [==============================] - 222s 3ms/step - loss: 0.6932\n",
549+
"Epoch 3/5\n",
550+
"87296/87296 [==============================] - 222s 3ms/step - loss: 0.6930\n",
551+
"Epoch 4/5\n",
552+
"87296/87296 [==============================] - 222s 3ms/step - loss: 0.6924\n",
553+
"Epoch 5/5\n",
554+
"87296/87296 [==============================] - 221s 3ms/step - loss: 0.6911\n",
555+
"CPU times: user 1h 5min 19s, sys: 16min 44s, total: 1h 22min 3s\n",
556+
"Wall time: 18min 30s\n"
557+
]
558+
},
559+
{
560+
"data": {
561+
"text/plain": [
562+
"<keras.callbacks.History at 0x7fdda382a5c0>"
563+
]
564+
},
565+
"execution_count": 29,
566+
"metadata": {},
567+
"output_type": "execute_result"
568+
}
569+
],
570+
"source": [
571+
"%%time\n",
572+
"# 4 GPU - Main training loop: 22min 49s\n",
573+
"# 4 GPU - Synthetic data: 18min 30s\n",
574+
"model.fit(fake_X,\n",
575+
" fake_y,\n",
576+
" batch_size=BATCHSIZE,\n",
577+
" epochs=EPOCHS,\n",
578+
" verbose=1)"
579+
]
513580
}
514581
],
515582
"metadata": {

0 commit comments

Comments
 (0)