wenouyang
diff --git a/Diff for: ‎.gitignore
+2-1 b/Diff for: ‎.gitignore
+2-1
diff --git a/Diff for: ‎README.md
+25-10 b/Diff for: ‎README.md
+25-10
diff --git a/Diff for: ‎support/CNTK_CNN_highAPI.ipynb renamed to ‎notebooks/CNTK_CNN_highAPI.ipynb b/Diff for: ‎support/CNTK_CNN_highAPI.ipynb renamed to ‎notebooks/CNTK_CNN_highAPI.ipynb
diff --git a/Diff for: ‎notebooks/Gluon_CNN.ipynb
+51-31 b/Diff for: ‎notebooks/Gluon_CNN.ipynb
+51-31
@@ -5,4 +5,5 @@
 cifar-10-batches-py/
 __pycache__
 .DS_Store
-
+*.params
+*-symbol.json
@@ -8,6 +8,8 @@
 
 **For more details check out our [blog-post](https://blogs.technet.microsoft.com/machinelearning/2018/03/14/comparing-deep-learning-frameworks-a-rosetta-stone-approach/)**
 
+We want to extend our gratitude to the CNTK, Pytorch, Chainer, Caffe2, MXNet and Knet teams, and everyone else from the open-source community who contributed to the repo over the past few months.
+
 ## Goal
 
 1. Create a Rosetta Stone of deep-learning frameworks to allow data-scientists to easily leverage their expertise from one framework to another
@@ -19,6 +21,8 @@
 
 The notebooks are executed on an Azure [Deep Learning Virtual Machine](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning) using both the K80 and the newer P100. 
 
+Please see these notebooks as examples rather than any formal benchmark (only the multi-GPU examples have potential to be that at some point) - for further details check this [post](https://www.reddit.com/r/MachineLearning/comments/7v3ibo/discussion_stop_benchmark_stupidity_and_improve_it/dtp9hng/).
+
 *Accuracies are reported in notebooks, they should match to ensure we have common mode/code*
 
 ## Results
@@ -29,20 +33,20 @@ The notebooks are executed on an Azure [Deep Learning Virtual Machine](https://a
 | ----------------------------------------------------- | :----------------: | :-----------------: |
 | [Caffe2](notebooks/Caffe2_CNN.ipynb)                  |        148         |         54          |
 | [Chainer](notebooks/Chainer_CNN.ipynb)                |        162         |         69          |
-| [CNTK](notebooks/CNTK_CNN.ipynb)                      |        163         |         53          |
-| [Gluon](notebooks/Gluon_CNN.ipynb)                    |        152         |         62          |
+| [CNTK](notebooks/CNTK_CNN.ipynb)  ([HighAPI](notebooks/CNTK_CNN_highAPI.ipynb))                    |        163         |         53          |
+| [MXNet(Gluon)](notebooks/Gluon_CNN.ipynb)             |        152         |         57          |
 | [Keras(CNTK)](notebooks/Keras_CNTK_CNN.ipynb)         |        194         |         76          |
 | [Keras(TF)](notebooks/Keras_TF_CNN.ipynb)             |        241         |         76          |
 | [Keras(Theano)](notebooks/Keras_Theano_CNN.ipynb)     |        269         |         93          |
-| [Tensorflow](notebooks/Tensorflow_CNN.ipynb)          |        173         |         57          |
+| [Tensorflow](notebooks/Tensorflow_CNN.ipynb)  ([HighAPI](notebooks/Tensorflow_CNN_highAPI.ipynb))  |        173         |         57          |
 | [Lasagne(Theano)](notebooks/Theano_Lasagne_CNN.ipynb) |        253         |         65          |
-| [MXNet](notebooks/MXNet_CNN.ipynb)                    |        145         |         51          |
+| [MXNet(Module API)](notebooks/MXNet_CNN.ipynb)  ([HighAPI](notebooks/MXNet_CNN_highAPI.ipynb))     |        145         |         52          |
 | [PyTorch](notebooks/PyTorch_CNN.ipynb)                |        169         |         51          |
 | [Julia - Knet](notebooks/Knet_CNN.ipynb)              |        159         |         ??          |
 | [R - MXNet](notebooks/.ipynb)                         |        ???         |         ??          |
 
 
-*Note: It is recommended to use higher level APIs where possible; see these notebooks for examples with [Tensorflow](support/Tensorflow_CNN_highAPI.ipynb), [MXNet](support/MXNet_CNN_highAPI.ipynb) and [CNTK](support/CNTK_CNN_highAPI.ipynb). They are not linked in the table to keep the common-structure-for-all approach*
+*Note: It is recommended to use higher level APIs where possible; see these notebooks for examples with [Tensorflow](notebooks/Tensorflow_CNN_highAPI.ipynb), [MXNet](notebooks/MXNet_CNN_highAPI.ipynb) and [CNTK](notebooks/CNTK_CNN_highAPI.ipynb). They are not linked in the table to keep the common-structure-for-all approach*
 
 Input for this model is the standard [CIFAR-10 dataset](http://www.cs.toronto.edu/~kriz/cifar.html) containing 50k training images and 10k test images, uniformly split across 10 classes. Each 32 by 32 image is supplied as a tensor of shape (3, 32, 32) with pixel intensity re-scaled from 0-255 to 0-1. 
 
@@ -58,7 +62,7 @@ Input for this model is the standard [CIFAR-10 dataset](http://www.cs.toronto.ed
 | [Keras(TF)](notebooks/Keras_TF_MultiGPU.ipynb)    | 51min27s              | 32min1s               | 22min49s              | 18min30s              |
 | [Tensorflow](notebooks/Tensorflow_MultiGPU.ipynb) | 62min8s               | 44min13s              | 31min4s               | 17min10s              |
 | [Chainer]()                                       | ?                     | ?                     | ?                     | ?                     |
-| [MXNet]()                                         | ?                     | ?                     | ?                     | ?                     |
+| [MXNet(Module API)]()                             | ?                     | ?                     | ?                     | ?                     |
 
 
 Input for this model is 112,120 PNGs of chest X-rays. **Note for the notebook to automatically download the data you must install [Azcopy](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-linux#download-and-install-azcopy) and increase the size of your OS-Disk in Azure Portal so that you have at-least 45GB of free-space (the Chest X-ray data is large!). The notebooks may take more than 10 minutes to first download the data.** These notebooks train DenseNet-121 and use native data-loaders to pre-process the data and perform data-augmentation. 
@@ -72,10 +76,11 @@ Comparing synthetic data to actual PNG files we can estimate the IO lag for **Py
 | [Caffe2](notebooks/Caffe2_Inference.ipynb)          | 14.1               | 7.9                 |
 | [Chainer](notebooks/Chainer_Inference.ipynb)        | 9.3                | 2.7                 |
 | [CNTK](notebooks/CNTK_Inference.ipynb)              | 8.5                | 1.6                 |
+| [MXNet(Gluon)](notebooks/Gluon_Inference.ipynb)     |                    | 1.7                 |
 | [Keras(CNTK)](notebooks/Keras_CNTK_Inference.ipynb) | 21.7               | 5.9                 |
 | [Keras(TF)](notebooks/Keras_TF_Inference.ipynb)     | 10.2               | 2.9                 |
 | [Tensorflow](notebooks/Tensorflow_Inference.ipynb)  | 6.5                | 1.8                 |
-| [MXNet](notebooks/MXNet_Inference.ipynb)            | 7.7                | 2.0                 |
+| [MXNet(Module API)](notebooks/MXNet_Inference.ipynb)| 7.7                | 1.6                 |
 | [PyTorch](notebooks/PyTorch_Inference.ipynb)        | 7.7                | 1.9                 |
 | [Julia - Knet](notebooks/Knet_Inference.ipynb)      | 6.3                | ???                 |
 | [R - MXNet](notebooks/.ipynb)                       | ???                | ???                 |
@@ -90,7 +95,7 @@ A pre-trained ResNet50 model is loaded and chopped just after the avg_pooling at
 | [CNTK](notebooks/CNTK_RNN.ipynb)                   | 32                 | 15                  | Yes          |
 | [Keras(CNTK)](notebooks/Keras_CNTK_RNN.ipynb)      | 86                 | 53                  | No           |
 | [Keras(TF)](notebooks/Keras_TF_RNN.ipynb)          | 35                 | 26                  | Yes          |
-| [MXNet](notebooks/MXNet_RNN.ipynb)                 | 29                 | 24                  | Yes          |
+| [MXNet(Module API)](notebooks/MXNet_RNN.ipynb)     | 29                 | 24                  | Yes          |
 | [Pytorch](notebooks/PyTorch_RNN.ipynb)             | 31                 | 16                  | Yes          |
 | [Tensorflow](notebooks/Tensorflow_RNN.ipynb)       | 30                 | 22                  | Yes          |
 | [Julia - Knet](notebooks/Knet_RNN.ipynb)           | 29                 | ??                  | Yes          |
@@ -107,9 +112,13 @@ The classification model creates an embedding matrix of size (150x125) and then
 
 ## Lessons Learned
 
-#### CNN
+The below offer some insights we gained after trying to match test-accuracy across frameworks and from all the GitHub issues/PRs raised.
 
-The below offers some insights I gained after trying to match test-accuracy across frameworks and from all the GitHub issues/PRs raised.
+#### Multi-GPU DenseNet
+
+1. Data loading and augmentation has the potential to flip the results around and curiously from all of the frameworks it seems by default Keras is most efficient at this. We will try to create openCV-based common data-loading and data-augmentation functions to help standardise the results and let forward+backward training take centre stage
+
+#### CNN
 
 1. The above examples (except for Keras), for ease of comparison, try to use the same level of API and so all use the same generator-function. For [MXNet](support/MXNet_CNN_highAPI.ipynb), [Tensorflow](support/Tensorflow_CNN_highAPI.ipynb), and [CNTK](support/CNTK_CNN_highAPI.ipynb) I have experimented with a higher-level API, where I use the framework's training generator function. The speed improvement is negligible in this example because the whole dataset is loaded as NumPy array in RAM and the only processing done each epoch is a shuffle. I suspect the framework's generators perform the shuffle asynchronously. Curiously, it seems that the frameworks shuffle on a batch-level, rather than on an observation level, and thus ever so slightly decreases the test-accuracy (at least after 10 epochs). For scenarios where we have IO activity and perhaps pre-processing and data-augmentation on the fly, custom generators would have a much bigger impact on performance.
 
@@ -161,6 +170,12 @@ The below offers some insights I gained after trying to match test-accuracy acro
    make install
    ```
 
+13. When using MXNet, you should avoid assigning outputs or data to numpy np.array in your training loop. This causes the data to be copied from the GPU to the CPU. You should use mx.nd.array instead, allocated in the right context at the beginning. This can dramatically increase performance.
+
+14. When using MXNet, operations are allocated on the queue of the back-end engine and parallelized, try to avoid any blocking operations in your training loop. You can add a nd.waitall(), which will force waiting for all operations to complete at the end of each epoch to avoid filling up your memory.
+
+15. With MXNet/Gluon, calling `.hybridize()` on your network will cache the computation graph and you will get performance gains. However that means that you won't be able to step through every calculations anymore. Use it once you are done debugging your network.
+
 #### RNN
 
 1. There are multiple RNN implementations/kernels available for most frameworks (for example [Tensorflow](http://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html)); once reduced down to the cudnnLSTM/GRU level the execution is the fastest, however this implementation is less flexible (e.g. maybe you want layer normalisation) and may become problematic if inference is run on the CPU at a later stage. At the cudDNN level most of the frameworks' runtimes are very similar. [This](https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/) Nvidia blog-post goes through several interesting cuDNN optimisations for recurrent neural nets e.g. fusing - "combining the computation of many small matrices into that of larger ones and streaming the computation whenever possible, the ratio of computation to memory I/O can be increased, which results in better performance on GPU".
 
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# High-level Gluon Example"
+    "# MXNet/Gluon CNN example"
    ]
   },
   {
@@ -71,7 +71,7 @@
    "outputs": [],
    "source": [
     "def SymbolModule(n_classes=N_CLASSES):\n",
-    "    sym = gluon.nn.Sequential()\n",
+    "    sym = gluon.nn.HybridSequential()\n",
     "    with sym.name_scope():\n",
     "        sym.add(gluon.nn.Conv2D(channels=50, kernel_size=3, padding=1, activation='relu'))\n",
     "        sym.add(gluon.nn.Conv2D(channels=50, kernel_size=3, padding=1))\n",
@@ -121,8 +121,8 @@
       "Preparing test set...\n",
       "(50000, 3, 32, 32) (10000, 3, 32, 32) (50000,) (10000,)\n",
       "float32 float32 int32 int32\n",
-      "CPU times: user 630 ms, sys: 588 ms, total: 1.22 s\n",
-      "Wall time: 1.22 s\n"
+      "CPU times: user 708 ms, sys: 589 ms, total: 1.3 s\n",
+      "Wall time: 1.29 s\n"
      ]
     }
    ],
@@ -143,8 +143,8 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "CPU times: user 321 ms, sys: 392 ms, total: 713 ms\n",
-      "Wall time: 876 ms\n"
+      "CPU times: user 345 ms, sys: 421 ms, total: 766 ms\n",
+      "Wall time: 768 ms\n"
      ]
     }
    ],
@@ -164,8 +164,8 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "CPU times: user 203 µs, sys: 128 µs, total: 331 µs\n",
-      "Wall time: 337 µs\n"
+      "CPU times: user 683 µs, sys: 444 µs, total: 1.13 ms\n",
+      "Wall time: 406 µs\n"
      ]
     }
    ],
@@ -178,31 +178,49 @@
    "cell_type": "code",
    "execution_count": 9,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_loss = nd.zeros(1, ctx=ctx)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_loss += nd.ones(1, ctx=ctx)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Epoch   0: loss: 1.8405\n",
-      "Epoch   1: loss: 1.3773\n",
-      "Epoch   2: loss: 1.1577\n",
-      "Epoch   3: loss: 0.9811\n",
-      "Epoch   4: loss: 0.8450\n",
-      "Epoch   5: loss: 0.7354\n",
-      "Epoch   6: loss: 0.6391\n",
-      "Epoch   7: loss: 0.5559\n",
-      "Epoch   8: loss: 0.4810\n",
-      "Epoch   9: loss: 0.4157\n",
-      "CPU times: user 1min 18s, sys: 15.3 s, total: 1min 34s\n",
-      "Wall time: 1min 2s\n"
+      "Epoch   0: loss: 1.8314\n",
+      "Epoch   1: loss: 1.3397\n",
+      "Epoch   2: loss: 1.1221\n",
+      "Epoch   3: loss: 0.9576\n",
+      "Epoch   4: loss: 0.8261\n",
+      "Epoch   5: loss: 0.7215\n",
+      "Epoch   6: loss: 0.6226\n",
+      "Epoch   7: loss: 0.5389\n",
+      "Epoch   8: loss: 0.4729\n",
+      "Epoch   9: loss: 0.4072\n",
+      "CPU times: user 1min 5s, sys: 18 s, total: 1min 23s\n",
+      "Wall time: 56.6 s\n"
      ]
     }
    ],
    "source": [
     "%%time\n",
-    "# Main training loop: 62s\n",
+    "sym.hybridize()\n",
     "for j in range(EPOCHS):\n",
-    "    train_loss = 0.0\n",
+    "    train_loss = nd.zeros(1, ctx=ctx)\n",
     "    for data, target in yield_mb(x_train, y_train, BATCHSIZE, shuffle=True):\n",
     "        # Get samples\n",
     "        data = nd.array(data).as_in_context(ctx)\n",
@@ -215,22 +233,24 @@
     "        # Back-prop\n",
     "        loss.backward()\n",
     "        trainer.step(data.shape[0])\n",
-    "        train_loss += nd.sum(loss).asscalar()\n",
-    "    # Log\n",
-    "    print('Epoch %3d: loss: %5.4f'%(j, train_loss/len(x_train)))"
+    "        train_loss += nd.sum(loss)\n",
+    "    # Log    \n",
+    "    # Waiting for the operations on the     \n",
+    "    nd.waitall()\n",
+    "    print('Epoch %3d: loss: %5.4f'%(j, train_loss.asscalar()/len(x_train)))"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "CPU times: user 627 ms, sys: 73.1 ms, total: 700 ms\n",
-      "Wall time: 453 ms\n"
+      "CPU times: user 382 ms, sys: 115 ms, total: 496 ms\n",
+      "Wall time: 429 ms\n"
      ]
     }
    ],
@@ -254,14 +274,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Accuracy:  0.7661258012820513\n"
+      "Accuracy:  0.7675280448717948\n"
      ]
     }
    ],
@@ -273,7 +293,7 @@
  "metadata": {
   "anaconda-cloud": {},
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python [default]",
    "language": "python",
    "name": "python3"
   },