wenouyang
diff --git a/‎README.md‎
Lines changed: 23 additions & 8 deletions b/‎README.md‎
Lines changed: 23 additions & 8 deletions
@@ -1,16 +1,18 @@
 # Deep Learning Framework Examples
 
-![logo](support/logo.png)
+<p align="center">
+<img src="support/logo.png" alt="logo" width="50%"/>
+</p>
 
-*Note: We are finalising our multi-GPU (single-node) examples (DenseNet-121 + Data-augmentation + Logging + Data-loaders), which you can follow [here](https://github.com/ilkarman/DeepLearningFrameworks/tree/multi_gpu#2-training-time-densenet-121-on-chestxray---image-recognition-multi-gpu)*
+*Note: We have recently added multi-GPU (single-node) examples on fine-tuning DenseNet-121 on Chest X-rays aka [CheXnet](https://stanfordmlgroup.github.io/projects/chexnet/). This is still work-in-progress and contributions are highly welcome!*
 
 ## Goal
 
 1. Create a Rosetta Stone of deep-learning frameworks to allow data-scientists to easily leverage their expertise from one framework to another
-2. Optimised GPU code with minimal verbosity (simple examples)
+2. Optimised GPU code with using the most up-to-date highest-level APIs.
 3. Common setup for comparisons across GPUs (potentially CUDA versions and precision)
 4. Common setup for comparisons across languages (Python, Julia, R)
-5. Possibility to verify own installation
+5. Possibility to verify expected performance of own installation
 4. Collaboration between different open-source communities
 
 The notebooks are executed on an Azure [Deep Learning Virtual Machine](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning) using both the K80 and the newer P100. 
@@ -37,11 +39,25 @@ The notebooks are executed on an Azure [Deep Learning Virtual Machine](https://a
 | [Julia - Knet](notebooks/Knet_CNN.ipynb)              |        159         |         ??          |
 | [R - MXNet](notebooks/.ipynb)                         |        ???         |         ??          |
 
+
 *Note: It is recommended to use higher level APIs where possible; see these notebooks for examples with [Tensorflow](support/Tensorflow_CNN_highAPI.ipynb), [MXNet](support/MXNet_CNN_highAPI.ipynb) and [CNTK](support/CNTK_CNN_highAPI.ipynb). They are not linked in the table to keep the common-structure-for-all approach*
 
 Input for this model is the standard [CIFAR-10 dataset](http://www.cs.toronto.edu/~kriz/cifar.html) containing 50k training images and 10k test images, uniformly split across 10 classes. Each 32 by 32 image is supplied as a tensor of shape (3, 32, 32) with pixel intensity re-scaled from 0-255 to 0-1. 
 
-### 2. Avg Time(s) for 1000 images: ResNet-50 - Feature Extraction
+### 2. Training Time: DenseNet-121 on ChestXRay - Image Recognition (Multi-GPU)
+
+**This is a work in progress**
+
+| DL Library                                        | 1xP100/CUDA 9/CuDNN 7 | 2xP100/CUDA 9/CuDNN 7 | 4xP100/CUDA 9/CuDNN 7 | 
+| -----------------------------------------------   | :------------------:  | :-------------------: | :------------------:  | 
+| [Pytorch](notebooks/PyTorch_MultiGPU.ipynb)       | 41min46s              | 28min50s              | 23min31s                     |
+| [Keras(TF)](notebooks/Keras_TF_MultiGPU.ipynb)    | 51min27s              | 32min1s               | 23min3s                     |
+| [Tensorflow](notebooks/Tensorflow_MultiGPU.ipynb) | 62min8s               | 44min13s              | 33min                     |
+
+
+Input for this model is 112,120 PNGs of chest X-rays. **Note for the notebook to automatically download the data you must install [Azcopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux#download-and-install-azcopy) and increase the size of your OS-Disk in Azure Portal so that you have at-least 45GB of free-space (the Chest X-ray data is large!). The notebooks may take more than 10 minutes to first download the data.** These notebooks train DenseNet-121 and use native data-loaders to pre-process the data and perform data-augmentation. We want to rewrite the data-loaders to use OpenCV instead of PIL to reduce IO-bottlenecking.
+
+### 3. Avg Time(s) for 1000 images: ResNet-50 - Feature Extraction
 
 | DL Library                                          | K80/CUDA 8/CuDNN 6 | P100/CUDA 8/CuDNN 6 |
 | --------------------------------------------------- | :----------------: | :-----------------: |
@@ -57,10 +73,9 @@ Input for this model is the standard [CIFAR-10 dataset](http://www.cs.toronto.ed
 | [R - MXNet](notebooks/.ipynb)                       | ???                | ???                 |
 
 
-
 A pre-trained ResNet50 model is loaded and chopped just after the avg_pooling at the end (7, 7), which outputs a 2048D dimensional vector. This can be plugged into a softmax layer or another classifier such as a boosted tree to perform transfer learning. Allowing for a warm start; this forward-only pass to the avg_pool layer is timed. *Note: batch-size remains constant, however filling the RAM on a GPU would produce further performance boosts (greater for GPUs with more RAM).*
 
-### 3. Training Time(s): RNN (GRU) on IMDB - Sentiment Analysis
+### 4. Training Time(s): RNN (GRU) on IMDB - Sentiment Analysis
 
 | DL Library                               | K80/CUDA 8/CuDNN 6 | P100/CUDA 8/CuDNN 6 | Using CuDNN? |
 | ---------------------------------------- | :----------------: | :----------------:  | :----------: |
@@ -142,4 +157,4 @@ The below offers some insights I gained after trying to match test-accuracy acro
 
 1. There are multiple RNN implementations/kernels available for most frameworks (for example [Tensorflow](http://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html)); once reduced down to the cudnnLSTM/GRU level the execution is the fastest, however this implementation is less flexible (e.g. maybe you want layer normalisation) and may become problematic if inference is run on the CPU at a later stage. At the cudDNN level most of the frameworks' runtimes are very similar. [This](https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/) Nvidia blog-post goes through several interesting cuDNN optimisations for recurrent neural nets e.g. fusing - "combining the computation of many small matrices into that of larger ones and streaming the computation whenever possible, the ratio of computation to memory I/O can be increased, which results in better performance on GPU".
 
-2. It seems that the fastest data-shape for RNNs is TNC - implementing this in [MXNet](support/MXNet_RNN_TNC.ipynb) only gave an improvement of 0.5s so I have chosen to use the sligthly slower shape to remain consistent with other frameworks and to keep the code less complicated
+2. It seems that the fastest data-shape for RNNs is TNC - implementing this in [MXNet](support/MXNet_RNN_TNC.ipynb) only gave an improvement of 0.5s so I have chosen to use the sligthly slower shape to remain consistent with other frameworks and to keep the code less complicated