ReturnToFirst
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎after_optimization.ipynb‎
Lines changed: 303 additions & 0 deletions b/‎after_optimization.ipynb‎
Lines changed: 303 additions & 0 deletions
@@ -0,0 +1 @@
+mnist_png/
@@ -0,0 +1,303 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Example : Optimized Tensorflow workflow\n",
+    "\n",
+    "## Summary\n",
+    "This example is optimized tensorflow training workflow.\n",
+    "\n",
+    "On this example, we will train hand-writing number classification model with [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database).\n",
+    "\n",
+    "\n",
+    "We will a lots of skills for solve problem of common training workflow\n",
+    "\n",
+    "#### Problem of unoptimized workflow\n",
+    "* Use too much VRAM(Even it really doesn't need)\n",
+    "* Slow Training Speed\n",
+    "\n",
+    "- - -\n",
+    "### Import pacakges"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tensorflow as tf\n",
+    "from nvidia.dali import pipeline_def, fn, types\n",
+    "import nvidia.dali.plugin.tf as dali_tf\n",
+    "\n",
+    "import os\n",
+    "import glob\n",
+    "import math\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### What those packages do?\n",
+    "* [TensorFlow](https://www.tensorflow.org/) : Define and training model.\n",
+    "* [nvidia.dali](https://developer.nvidia.com/dali/) : Preprocess and load data with GPU-acceleration.\n",
+    "* [os](https://docs.python.org/3/library/os.html) : Get label and join splited path to one.\n",
+    "* [glob](https://docs.python.org/3/library/glob.html) : Get all image files absolute path.\n",
+    "* [math](https://docs.python.org/3/library/math.html) : Compute iteration per epoch with ceil.\n",
+    "---\n",
+    "## Optimizing method\n",
+    "* GPU Accelerated Dataloader - [Nvidia DALI](https://developer.nvidia.com/dali/)\n",
+    "    * Reduce RAM - CPU - GPU Memory bottleneck with [GPU Direct Storage](https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html)\n",
+    "    * Data augmentation with GPU Acceleration\n",
+    "* Fast Forward/Backward Computation - [Mixed Precision Training](https://arxiv.org/abs/1710.03740)\n",
+    "    * Effective [MMA (Matrix Multiply-accumulate)](https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation) Computation on Nvidia Ampere GPU\n",
+    "* Optimized GPU job scheduler - [XLA](https://www.tensorflow.org/xla)\n",
+    "    * Optimize [SM (Stream Multiprocessor)](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf#page=22) interal job scheduling\n",
+    "* Change TensorFlow GPU memory strategy\n",
+    "    * Reduce GPU memory consumption of TensorFlow process\n",
+    "\n",
+    "### Set TensorFlow runtime setting\n",
+    "To enable mixed precision training and change GPU memory strategy, this code block need to be run."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gpu_ids = [0]\n",
+    "# Replace 0 with device id what you will use.\n",
+    "\n",
+    "# Get available GPUs\n",
+    "gpus = tf.config.list_physical_devices('GPU')\n",
+    "target_gpus = [gpus[gpu_id] for gpu_id in gpu_ids]\n",
+    "\n",
+    "# Set tensorflow can use all selected GPUs\n",
+    "tf.config.set_visible_devices(target_gpus, 'GPU')\n",
+    "\n",
+    "#Memory strategy change : allocate as much as possible -> allocate as need\n",
+    "for target_gpu in target_gpus:\n",
+    "    tf.config.experimental.set_memory_growth(target_gpu, True)\n",
+    "\n",
+    "# Make TensorFlow use mixed precision training\n",
+    "tf.keras.mixed_precision.set_global_policy('mixed_float16')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define dataset and dataloader\n",
+    "We will assume dataset is infinite or It can only stored partial dataset in [RAM](https://en.wikipedia.org/wiki/Random-access_memory).\\\n",
+    "So we will use `DALI` to load every decoded data to GPU Memory with [DMA(Direct Memory Access)](https://en.wikipedia.org/wiki/Direct_memory_access) and augment it.\n",
+    "\n",
+    "![](https://developer-blogs.nvidia.com/wp-content/uploads/2019/01/figure1_blogpost_dali_whitebg-625x177.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define batch size and images path for dataloader\n",
+    "batch_size = 2560\n",
+    "image_dir = r'./mnist_png/training/'\n",
+    "\n",
+    "# Define dali image pipeline\n",
+    "@pipeline_def(batch_size=batch_size)\n",
+    "def mnist_pipeline(image_dir):\n",
+    "    images, labels = fn.readers.file(file_root=image_dir)\n",
+    "    images = fn.decoders.image(images, device='mixed', output_type=types.GRAY)\n",
+    "    images = fn.crop_mirror_normalize(images, device=\"gpu\", dtype=types.FLOAT, std=[255.], output_layout=\"CHW\")\n",
+    "    labels = labels.gpu()\n",
+    "    return (images, labels)\n",
+    "\n",
+    "# Define shapes and dtypes for dali tensorflow dataloader\n",
+    "shapes = (\n",
+    "    (batch_size, 1, 28, 28),\n",
+    "    (batch_size))\n",
+    "dtypes = (\n",
+    "    tf.float32,\n",
+    "    tf.int32)\n",
+    "\n",
+    "\n",
+    "dataloader = dali_tf.DALIDataset(\n",
+    "    pipeline=mnist_pipeline(image_dir),\n",
+    "    batch_size=batch_size,\n",
+    "    output_shapes=shapes,\n",
+    "    output_dtypes=dtypes,\n",
+    "    device_id=gpu_ids[0]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### How DALI works\n",
+    "![](https://developer-blogs.nvidia.com/wp-content/uploads/2019/01/fig5_final.png)\n",
+    "\n",
+    "On this dataloader, `DALI` will load batch with this process\n",
+    "1. Decode image file on CPU to transform it to array.\n",
+    "2. Directly send raw array to GPU.\n",
+    "3. Preprocess data with GPU-Acceleration.\n",
+    "4. Prefetch next batch and load batch it to training process when batch end.\n",
+    "\n",
+    "Batch will prefetched like image below.\n",
+    "\n",
+    "![](imgs/prefetch.png)\n",
+    "\n",
+    "dataflow will be like picture below:\n",
+    "\n",
+    "<img src=\"./imgs/gpudirect_storage.png\" width='425px' height='450px'>\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### Define model, optimizer, loss function\n",
+    "\n",
+    "This example Task is 'Multi labels classification'. so model would like below.\n",
+    "\n",
+    "* Model is simple model Based on [Convolutional Layers](https://arxiv.org/abs/1511.08458).\n",
+    "* Loss function will be [sparse categorical crossentropy](https://datascience.stackexchange.com/questions/41921/sparse-categorical-crossentropy-vs-categorical-crossentropy-keras-accuracy).\n",
+    "* Optimizer will be [AdamW](https://arxiv.org/abs/1711.05101).\n",
+    "\n",
+    "For convenience, model's performance would be only measured by train set accuracy.\n",
+    "\n",
+    "#### Model architecture\n",
+    "<img src=\"./imgs/model_architecture.png\" width=\"300px\" height=\"500px\">"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = tf.keras.Sequential([\n",
+    "    tf.keras.layers.Conv2D(32, kernel_size=(3,3), input_shape=(1, 28, 28), activation='relu', data_format='channels_first'),\n",
+    "    tf.keras.layers.Conv2D(64, kernel_size=(3,3), activation='relu'),\n",
+    "    tf.keras.layers.MaxPool2D(pool_size=(2,2)),\n",
+    "    tf.keras.layers.Flatten(),\n",
+    "    tf.keras.layers.Dense(512, activation='relu'),\n",
+    "    tf.keras.layers.Dense(128)\n",
+    "])\n",
+    "\n",
+    "optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001)\n",
+    "loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)\n",
+    "metrics = ['accuracy']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Compile model and start training\n",
+    "\n",
+    "For use XLA, `jit_compile` flag must be `True` on model compile.\n",
+    "```\n",
+    "model.compile(..., jit_compile=True)\n",
+    "```\n",
+    "\n",
+    "\n",
+    "Current Training Environment is like below:\n",
+    "\n",
+    "|Precision|Batch preprocssing|Batch caching|GPU select|GPU memory strategy|\n",
+    "|---|---|---|---|---|\n",
+    "|FP16|Inline<br>Compute by GPU|Prefetch on batch demands<br>Stored in GPU memory|Selectable By user|Grow up when need|\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define train epochs\n",
+    "epochs = 100\n",
+    "\n",
+    "# Compile model with XLA\n",
+    "model.compile(optimizer=optimizer,\n",
+    "            loss=loss_fn,\n",
+    "            metrics=metrics,\n",
+    "\t\t\tjit_compile=True)\n",
+    "\n",
+    "# Compute interation per epoch for 'large but limited size dataset'\n",
+    "# If dataset's size is infinite, set how many step to do on 1 epoch\n",
+    "iteration_per_epoch = math.ceil(len(glob.glob(os.path.join(image_dir, '*/*.png')))/batch_size)\n",
+    "\n",
+    "model.fit(dataloader, epochs=epochs, steps_per_epoch=iteration_per_epoch)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### After Training\n",
+    "TensorFlow have [critical bug](https://github.com/tensorflow/tensorflow/issues/1727#issuecomment-225665915) that **won't release GPU memory** after model used(both Training, Evaluation).\\\n",
+    "So we need to free GPU memory for other users.\n",
+    "\n",
+    "#### Step\n",
+    "1. [Save trained model](https://www.tensorflow.org/guide/keras/save_and_serialize)\n",
+    "2. Kill Tensorflow Process"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_save_path = r'./latest.h5'\n",
+    "\n",
+    "# save model to file\n",
+    "model.save(model_save_path)\n",
+    "exit(0)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Compare optimization Before & After \n",
+    "\n",
+    "\n",
+    "||Before|After|\n",
+    "|---|---|---\n",
+    "|**Precision**|TF32|FP16|\n",
+    "|**Dataloader**|TensorFlow|Nvidia DALI|\n",
+    "|**Batch caching**|Next batch only<br>RAM|Auto-Adjusted by DALI<br>GPU memory|\n",
+    "|**Batch preprocessing**|OpenCV/Numpy<br>CPU|DALI<br>GPU|\n",
+    "|**GPU Usage**|Training|Training<br>Preprocessing|\n",
+    "|**GPU Select**|Automatically Selected by TensorFlow|Selectable By user|\n",
+    "|**GPU memory strategy**|As much as Possible<br>([Automatically Selected by TensorFlow]((https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth)))|Grow up when need|"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "tensorflow",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}