Skip to content

Commit 57841da

Browse files
Merge pull request ilkarman#61 from ilkarman/multi_gpu
multi gpu examples
2 parents e93d0d0 + e38f9dc commit 57841da

5 files changed

Lines changed: 1965 additions & 9 deletions

File tree

README.md

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
11
# Deep Learning Framework Examples
22

3-
![logo](support/logo.png)
3+
<p align="center">
4+
<img src="support/logo.png" alt="logo" width="50%"/>
5+
</p>
46

5-
*Note: We are finalising our multi-GPU (single-node) examples (DenseNet-121 + Data-augmentation + Logging + Data-loaders), which you can follow [here](https://github.com/ilkarman/DeepLearningFrameworks/tree/multi_gpu#2-training-time-densenet-121-on-chestxray---image-recognition-multi-gpu)*
7+
*Note: We have recently added multi-GPU (single-node) examples on fine-tuning DenseNet-121 on Chest X-rays aka [CheXnet](https://stanfordmlgroup.github.io/projects/chexnet/). This is still work-in-progress and contributions are highly welcome!*
68

79
## Goal
810

911
1. Create a Rosetta Stone of deep-learning frameworks to allow data-scientists to easily leverage their expertise from one framework to another
10-
2. Optimised GPU code with minimal verbosity (simple examples)
12+
2. Optimised GPU code with using the most up-to-date highest-level APIs.
1113
3. Common setup for comparisons across GPUs (potentially CUDA versions and precision)
1214
4. Common setup for comparisons across languages (Python, Julia, R)
13-
5. Possibility to verify own installation
15+
5. Possibility to verify expected performance of own installation
1416
4. Collaboration between different open-source communities
1517

1618
The notebooks are executed on an Azure [Deep Learning Virtual Machine](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning) using both the K80 and the newer P100.
@@ -37,11 +39,25 @@ The notebooks are executed on an Azure [Deep Learning Virtual Machine](https://a
3739
| [Julia - Knet](notebooks/Knet_CNN.ipynb) | 159 | ?? |
3840
| [R - MXNet](notebooks/.ipynb) | ??? | ?? |
3941

42+
4043
*Note: It is recommended to use higher level APIs where possible; see these notebooks for examples with [Tensorflow](support/Tensorflow_CNN_highAPI.ipynb), [MXNet](support/MXNet_CNN_highAPI.ipynb) and [CNTK](support/CNTK_CNN_highAPI.ipynb). They are not linked in the table to keep the common-structure-for-all approach*
4144

4245
Input for this model is the standard [CIFAR-10 dataset](http://www.cs.toronto.edu/~kriz/cifar.html) containing 50k training images and 10k test images, uniformly split across 10 classes. Each 32 by 32 image is supplied as a tensor of shape (3, 32, 32) with pixel intensity re-scaled from 0-255 to 0-1.
4346

44-
### 2. Avg Time(s) for 1000 images: ResNet-50 - Feature Extraction
47+
### 2. Training Time: DenseNet-121 on ChestXRay - Image Recognition (Multi-GPU)
48+
49+
**This is a work in progress**
50+
51+
| DL Library | 1xP100/CUDA 9/CuDNN 7 | 2xP100/CUDA 9/CuDNN 7 | 4xP100/CUDA 9/CuDNN 7 |
52+
| ----------------------------------------------- | :------------------: | :-------------------: | :------------------: |
53+
| [Pytorch](notebooks/PyTorch_MultiGPU.ipynb) | 41min46s | 28min50s | 23min31s |
54+
| [Keras(TF)](notebooks/Keras_TF_MultiGPU.ipynb) | 51min27s | 32min1s | 23min3s |
55+
| [Tensorflow](notebooks/Tensorflow_MultiGPU.ipynb) | 62min8s | 44min13s | 33min |
56+
57+
58+
Input for this model is 112,120 PNGs of chest X-rays. **Note for the notebook to automatically download the data you must install [Azcopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux#download-and-install-azcopy) and increase the size of your OS-Disk in Azure Portal so that you have at-least 45GB of free-space (the Chest X-ray data is large!). The notebooks may take more than 10 minutes to first download the data.** These notebooks train DenseNet-121 and use native data-loaders to pre-process the data and perform data-augmentation. We want to rewrite the data-loaders to use OpenCV instead of PIL to reduce IO-bottlenecking.
59+
60+
### 3. Avg Time(s) for 1000 images: ResNet-50 - Feature Extraction
4561

4662
| DL Library | K80/CUDA 8/CuDNN 6 | P100/CUDA 8/CuDNN 6 |
4763
| --------------------------------------------------- | :----------------: | :-----------------: |
@@ -57,10 +73,9 @@ Input for this model is the standard [CIFAR-10 dataset](http://www.cs.toronto.ed
5773
| [R - MXNet](notebooks/.ipynb) | ??? | ??? |
5874

5975

60-
6176
A pre-trained ResNet50 model is loaded and chopped just after the avg_pooling at the end (7, 7), which outputs a 2048D dimensional vector. This can be plugged into a softmax layer or another classifier such as a boosted tree to perform transfer learning. Allowing for a warm start; this forward-only pass to the avg_pool layer is timed. *Note: batch-size remains constant, however filling the RAM on a GPU would produce further performance boosts (greater for GPUs with more RAM).*
6277

63-
### 3. Training Time(s): RNN (GRU) on IMDB - Sentiment Analysis
78+
### 4. Training Time(s): RNN (GRU) on IMDB - Sentiment Analysis
6479

6580
| DL Library | K80/CUDA 8/CuDNN 6 | P100/CUDA 8/CuDNN 6 | Using CuDNN? |
6681
| ---------------------------------------- | :----------------: | :----------------: | :----------: |
@@ -142,4 +157,4 @@ The below offers some insights I gained after trying to match test-accuracy acro
142157

143158
1. There are multiple RNN implementations/kernels available for most frameworks (for example [Tensorflow](http://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html)); once reduced down to the cudnnLSTM/GRU level the execution is the fastest, however this implementation is less flexible (e.g. maybe you want layer normalisation) and may become problematic if inference is run on the CPU at a later stage. At the cudDNN level most of the frameworks' runtimes are very similar. [This](https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/) Nvidia blog-post goes through several interesting cuDNN optimisations for recurrent neural nets e.g. fusing - "combining the computation of many small matrices into that of larger ones and streaming the computation whenever possible, the ratio of computation to memory I/O can be increased, which results in better performance on GPU".
144159

145-
2. It seems that the fastest data-shape for RNNs is TNC - implementing this in [MXNet](support/MXNet_RNN_TNC.ipynb) only gave an improvement of 0.5s so I have chosen to use the sligthly slower shape to remain consistent with other frameworks and to keep the code less complicated
160+
2. It seems that the fastest data-shape for RNNs is TNC - implementing this in [MXNet](support/MXNet_RNN_TNC.ipynb) only gave an improvement of 0.5s so I have chosen to use the sligthly slower shape to remain consistent with other frameworks and to keep the code less complicated

0 commit comments

Comments
 (0)