Skip to content

Commit 90a4b1c

Browse files
author
Ashwin Hegde
committed
feat(docs): preprocessing
1 parent 21c8731 commit 90a4b1c

File tree

3 files changed

+260
-0
lines changed

3 files changed

+260
-0
lines changed

Diff for: Preprocessing.ipynb

+260
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Preprocessing"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"What is the motivation for preprocessing?\n",
15+
"\n",
16+
"1. Compatibility\n",
17+
"\n",
18+
" * Enable to compatibility with the library we use. For example TensorFlow work with `Tensor` and not with `Excel` or `csv` etc.\n",
19+
" * Data can be in any format, we need to make it compatiable with whatever tools we use."
20+
]
21+
},
22+
{
23+
"cell_type": "markdown",
24+
"metadata": {},
25+
"source": [
26+
"## Standardization\n",
27+
"\n",
28+
"* The process of transforming data into a standard scale.\n",
29+
"* This is also know as `Feature Scaling`.\n",
30+
"\n",
31+
"```\n",
32+
"standardized variable = original variable - mean of original variable / standard deviation of original variable\n",
33+
"```\n",
34+
"\n",
35+
"Consider the algorithm has 2 input variables\n",
36+
"\n",
37+
"1. Exchange rate\n",
38+
"2. Daily trading volume\n",
39+
"\n",
40+
"And we have 3 days worth of observations as below:\n",
41+
"\n",
42+
"|Day| Exchange rate | Daily trading volume|\n",
43+
"|:---|:---|:---|\n",
44+
"|1|1.3|110000|\n",
45+
"|2|1.34|98700|\n",
46+
"|3|1.25|135000|\n",
47+
"\n",
48+
"Here,\n",
49+
"\n",
50+
"* The mean for exchange rate is `1.3`\n",
51+
"\n",
52+
"* The standard deviation is `0.0.45`\n",
53+
"\n",
54+
"\n",
55+
"\n"
56+
]
57+
},
58+
{
59+
"cell_type": "markdown",
60+
"metadata": {},
61+
"source": [
62+
"## One-hot encoding\n",
63+
"\n",
64+
"* One-hot encoding is a encoding technique to transform data into numerical form which model can understand.\n",
65+
"\n",
66+
"* This technique is applied on categorical data when dealing with few categories.\n",
67+
"\n",
68+
"### Categorical data\n",
69+
"\n",
70+
"* Categorical data are variables that contain label values rather than numeric values.\n",
71+
"* Categorical variables are also called `Nominal`.\n",
72+
"\n",
73+
"For example:\n",
74+
"\n",
75+
"1. A \"pet\" variable with the values \"dog\", \"cat\" etc.\n",
76+
"2. A \"color\" variable with the values \"red\", \"green\" and \"blue\".\n",
77+
"\n",
78+
"**Notes**\n",
79+
"\n",
80+
"* Some algorithms can work with categorical data directly, for eg. a decision tree can be learned directly from categorical data with no data transformation.\n",
81+
"\n",
82+
"* Many algorithms cannot operate on label data directly, they require all input and output variables to be numeric form. Thus, encoding is required.\n",
83+
"\n",
84+
"### How to transform categorical data to numerical data?\n",
85+
"\n",
86+
"There are 2 steps involve\n",
87+
"\n",
88+
"1. Label/Integer encoding\n",
89+
"2. One-hot encoding\n",
90+
"\n",
91+
"#### Integer encoding\n",
92+
"\n",
93+
"* Each unique category value is assigned an integer value.\n",
94+
"\n",
95+
"For example\n",
96+
"\n",
97+
"|Food name|Categorical #|Calories|\n",
98+
"|:---|:---|:---|\n",
99+
"|Apple|1|95|\n",
100+
"|Orange|2|100|\n",
101+
"|Broccoli|3|50|\n",
102+
"\n",
103+
"* There are few problems with above encoding:\n",
104+
" \n",
105+
" 1. The integer values have a natural ordered relationship between each other. Now, if your model internally needs to calculate the average across categirues, it might do `1+3 = 4/2 = 2`. This means that according to your model, the average of Apple, Orange together is Broccali.\n",
106+
"\n",
107+
"#### One-hot encoding\n",
108+
"\n",
109+
"* For categorical variables where no relationship exists, the integer encoding is not enough.\n",
110+
"\n",
111+
"* In fact, using integer encoding and allowing model to assume a natural ordering between categories may result in poor performance or unexpected results.\n",
112+
"\n",
113+
"* In this case, a one-hot encoding can be applied to the integer representation.\n",
114+
"\n",
115+
"For example:\n",
116+
"\n",
117+
"|Apple|Orange|Broccoli|Calories|\n",
118+
"|:---|:---|:---|:---|\n",
119+
"|1|0|0|95|\n",
120+
"|0|1|0|100|\n",
121+
"|0|0|1|50|"
122+
]
123+
},
124+
{
125+
"cell_type": "markdown",
126+
"metadata": {},
127+
"source": [
128+
"### One-hot encoding using TensorFlow 2.0.0/Keras\n",
129+
"\n",
130+
"`one_hot` method in TensorFlow that can convert a set of sparse labels to a dense one-hot representation"
131+
]
132+
},
133+
{
134+
"cell_type": "code",
135+
"execution_count": 67,
136+
"metadata": {},
137+
"outputs": [
138+
{
139+
"name": "stdout",
140+
"output_type": "stream",
141+
"text": [
142+
"Tensor(\"one_hot_23:0\", shape=(3, 3), dtype=float32)\n",
143+
"[[1. 0. 0.]\n",
144+
" [0. 1. 0.]\n",
145+
" [0. 0. 1.]]\n"
146+
]
147+
}
148+
],
149+
"source": [
150+
"import tensorflow.compat.v1 as tf\n",
151+
"\n",
152+
"output = tf.one_hot(indices=[0, 1, 2], depth=3)\n",
153+
"print(output)\n",
154+
"\n",
155+
"with tf.Session() as sess:\n",
156+
" result = sess.run(output)\n",
157+
"print(result)"
158+
]
159+
},
160+
{
161+
"cell_type": "markdown",
162+
"metadata": {},
163+
"source": [
164+
"### One-hot encoding using Sk-Learn"
165+
]
166+
},
167+
{
168+
"cell_type": "code",
169+
"execution_count": 68,
170+
"metadata": {},
171+
"outputs": [
172+
{
173+
"name": "stdout",
174+
"output_type": "stream",
175+
"text": [
176+
"['Apple' 'Orange' 'Broccoli' 'Apple' 'Grape']\n",
177+
"[0 3 1 0 2]\n",
178+
"[[1. 0. 0. 0.]\n",
179+
" [0. 0. 0. 1.]\n",
180+
" [0. 1. 0. 0.]\n",
181+
" [1. 0. 0. 0.]\n",
182+
" [0. 0. 1. 0.]]\n"
183+
]
184+
},
185+
{
186+
"name": "stderr",
187+
"output_type": "stream",
188+
"text": [
189+
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py:415: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.\n",
190+
"If you want the future behaviour and silence this warning, you can specify \"categories='auto'\".\n",
191+
"In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.\n",
192+
" warnings.warn(msg, FutureWarning)\n"
193+
]
194+
}
195+
],
196+
"source": [
197+
"from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
198+
"\n",
199+
"data = [\"Apple\", \"Orange\", \"Broccoli\", \"Apple\", \"Grape\"]\n",
200+
"\n",
201+
"docs1 = array(data)\n",
202+
"print(docs1)\n",
203+
"\n",
204+
"label_encoding = LabelEncoder()\n",
205+
"integer_encoded = label_encoding.fit_transform(data)\n",
206+
"print(integer_encoded)\n",
207+
"\n",
208+
"onehot_encoder = OneHotEncoder(sparse=False)\n",
209+
"integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)\n",
210+
"onehot_encoder = onehot_encoder.fit_transform(integer_encoded)\n",
211+
"print(onehot_encoder)"
212+
]
213+
},
214+
{
215+
"cell_type": "markdown",
216+
"metadata": {},
217+
"source": [
218+
"## References\n",
219+
"\n",
220+
"* [Nominal Category](https://en.wikipedia.org/wiki/Nominal_category)\n",
221+
"\n",
222+
"* [Categorical Variable](https://en.wikipedia.org/wiki/Categorical_variable)\n",
223+
"\n",
224+
"* [One-hot Encoding](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)\n",
225+
"\n",
226+
"* [One-hot Tensor](https://www.tensorflow.org/api_docs/python/tf/one_hot)\n",
227+
"\n",
228+
"https://www.programcreek.com/python/example/90553/tensorflow.one_hot"
229+
]
230+
},
231+
{
232+
"cell_type": "code",
233+
"execution_count": null,
234+
"metadata": {},
235+
"outputs": [],
236+
"source": []
237+
}
238+
],
239+
"metadata": {
240+
"kernelspec": {
241+
"display_name": "Python 3",
242+
"language": "python",
243+
"name": "python3"
244+
},
245+
"language_info": {
246+
"codemirror_mode": {
247+
"name": "ipython",
248+
"version": 3
249+
},
250+
"file_extension": ".py",
251+
"mimetype": "text/x-python",
252+
"name": "python",
253+
"nbconvert_exporter": "python",
254+
"pygments_lexer": "ipython3",
255+
"version": "3.6.0"
256+
}
257+
},
258+
"nbformat": 4,
259+
"nbformat_minor": 2
260+
}

Diff for: _assets_/deep_net_1.png

1.68 MB
Loading

Diff for: _assets_/linear_plus_nonlinear.png

177 KB
Loading

0 commit comments

Comments
 (0)