the-code-experiments
diff --git a/‎Preprocessing.ipynb
Lines changed: 260 additions & 0 deletions b/‎Preprocessing.ipynb
Lines changed: 260 additions & 0 deletions
diff --git a/‎_assets_/deep_net_1.png
1.68 MB b/‎_assets_/deep_net_1.png
1.68 MB
diff --git a/‎_assets_/linear_plus_nonlinear.png
177 KB b/‎_assets_/linear_plus_nonlinear.png
177 KB
@@ -0,0 +1,260 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Preprocessing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "What is the motivation for preprocessing?\n",
+    "\n",
+    "1. Compatibility\n",
+    "\n",
+    "    * Enable to compatibility with the library we use. For example TensorFlow work with `Tensor` and not with `Excel` or `csv` etc.\n",
+    "    * Data can be in any format, we need to make it compatiable with whatever tools we use."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Standardization\n",
+    "\n",
+    "* The process of transforming data into a standard scale.\n",
+    "* This is also know as `Feature Scaling`.\n",
+    "\n",
+    "```\n",
+    "standardized variable = original variable - mean of original variable / standard deviation of original variable\n",
+    "```\n",
+    "\n",
+    "Consider the algorithm has 2 input variables\n",
+    "\n",
+    "1. Exchange rate\n",
+    "2. Daily trading volume\n",
+    "\n",
+    "And we have 3 days worth of observations as below:\n",
+    "\n",
+    "|Day| Exchange rate | Daily trading volume|\n",
+    "|:---|:---|:---|\n",
+    "|1|1.3|110000|\n",
+    "|2|1.34|98700|\n",
+    "|3|1.25|135000|\n",
+    "\n",
+    "Here,\n",
+    "\n",
+    "* The mean for exchange rate is `1.3`\n",
+    "\n",
+    "* The standard deviation is `0.0.45`\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## One-hot encoding\n",
+    "\n",
+    "* One-hot encoding is a encoding technique to transform data into numerical form which model can understand.\n",
+    "\n",
+    "* This technique is applied on categorical data when dealing with few categories.\n",
+    "\n",
+    "### Categorical data\n",
+    "\n",
+    "* Categorical data are variables that contain label values rather than numeric values.\n",
+    "* Categorical variables are also called `Nominal`.\n",
+    "\n",
+    "For example:\n",
+    "\n",
+    "1. A \"pet\" variable with the values \"dog\", \"cat\" etc.\n",
+    "2. A \"color\" variable with the values \"red\", \"green\" and \"blue\".\n",
+    "\n",
+    "**Notes**\n",
+    "\n",
+    "* Some algorithms can work with categorical data directly, for eg. a decision tree can be learned directly from categorical data with no data transformation.\n",
+    "\n",
+    "* Many algorithms cannot operate on label data directly, they require all input and output variables to be numeric form. Thus, encoding is required.\n",
+    "\n",
+    "### How to transform categorical data to numerical data?\n",
+    "\n",
+    "There are 2 steps involve\n",
+    "\n",
+    "1. Label/Integer encoding\n",
+    "2. One-hot encoding\n",
+    "\n",
+    "#### Integer encoding\n",
+    "\n",
+    "* Each unique category value is assigned an integer value.\n",
+    "\n",
+    "For example\n",
+    "\n",
+    "|Food name|Categorical #|Calories|\n",
+    "|:---|:---|:---|\n",
+    "|Apple|1|95|\n",
+    "|Orange|2|100|\n",
+    "|Broccoli|3|50|\n",
+    "\n",
+    "* There are few problems with above encoding:\n",
+    "   \n",
+    "   1. The integer values have a natural ordered relationship between each other. Now, if your model internally needs to calculate the average across categirues, it might do `1+3 = 4/2 = 2`. This means that according to your model, the average of Apple, Orange together is Broccali.\n",
+    "\n",
+    "#### One-hot encoding\n",
+    "\n",
+    "* For categorical variables where no relationship exists, the integer encoding is not enough.\n",
+    "\n",
+    "* In fact, using integer encoding and allowing model to assume a natural ordering between categories may result in poor performance or unexpected results.\n",
+    "\n",
+    "* In this case, a one-hot encoding can be applied to the integer representation.\n",
+    "\n",
+    "For example:\n",
+    "\n",
+    "|Apple|Orange|Broccoli|Calories|\n",
+    "|:---|:---|:---|:---|\n",
+    "|1|0|0|95|\n",
+    "|0|1|0|100|\n",
+    "|0|0|1|50|"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### One-hot encoding using TensorFlow 2.0.0/Keras\n",
+    "\n",
+    "`one_hot` method in TensorFlow that can convert a set of sparse labels to a dense one-hot representation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 67,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Tensor(\"one_hot_23:0\", shape=(3, 3), dtype=float32)\n",
+      "[[1. 0. 0.]\n",
+      " [0. 1. 0.]\n",
+      " [0. 0. 1.]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import tensorflow.compat.v1 as tf\n",
+    "\n",
+    "output = tf.one_hot(indices=[0, 1, 2], depth=3)\n",
+    "print(output)\n",
+    "\n",
+    "with tf.Session() as sess:\n",
+    "    result = sess.run(output)\n",
+    "print(result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### One-hot encoding using Sk-Learn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 68,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['Apple' 'Orange' 'Broccoli' 'Apple' 'Grape']\n",
+      "[0 3 1 0 2]\n",
+      "[[1. 0. 0. 0.]\n",
+      " [0. 0. 0. 1.]\n",
+      " [0. 1. 0. 0.]\n",
+      " [1. 0. 0. 0.]\n",
+      " [0. 0. 1. 0.]]\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:415: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.\n",
+      "If you want the future behaviour and silence this warning, you can specify \"categories='auto'\".\n",
+      "In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.\n",
+      "  warnings.warn(msg, FutureWarning)\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
+    "\n",
+    "data = [\"Apple\", \"Orange\", \"Broccoli\", \"Apple\", \"Grape\"]\n",
+    "\n",
+    "docs1 = array(data)\n",
+    "print(docs1)\n",
+    "\n",
+    "label_encoding = LabelEncoder()\n",
+    "integer_encoded = label_encoding.fit_transform(data)\n",
+    "print(integer_encoded)\n",
+    "\n",
+    "onehot_encoder = OneHotEncoder(sparse=False)\n",
+    "integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)\n",
+    "onehot_encoder = onehot_encoder.fit_transform(integer_encoded)\n",
+    "print(onehot_encoder)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## References\n",
+    "\n",
+    "* [Nominal Category](https://en.wikipedia.org/wiki/Nominal_category)\n",
+    "\n",
+    "* [Categorical Variable](https://en.wikipedia.org/wiki/Categorical_variable)\n",
+    "\n",
+    "* [One-hot Encoding](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)\n",
+    "\n",
+    "* [One-hot Tensor](https://www.tensorflow.org/api_docs/python/tf/one_hot)\n",
+    "\n",
+    "https://www.programcreek.com/python/example/90553/tensorflow.one_hot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}