Skip to content

Commit 73e160f

Browse files
committed
adjust "feature scaling" to be subsection 2.3 in episode 4
1 parent e8a8aad commit 73e160f

File tree

1 file changed

+20
-14
lines changed

1 file changed

+20
-14
lines changed

content/04-supervised-ML-classification.rst

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ Seaborn provides the Penguins dataset through its built-in data-loading function
4646
import seaborn as sns
4747
4848
penguins = sns.load_dataset('penguins')
49+
penguins
4950
5051
5152
.. csv-table::
@@ -76,7 +77,7 @@ There are seven columns include:
7677
- *body_mass_g*: body mass in grams
7778
- *sex*: male or female
7879

79-
Looking at numbers from `penguins` `penguins.describe()` usually does not give a very good intuition about the data we are working with, we have the preference to visualize the data.
80+
Looking at numbers from ``penguins`` and ``penguins.describe()`` usually does not give a very good intuition about the data we are working with, we have the preference to visualize the data.
8081

8182
One nice visualization for datasets with relatively few attributes is the Pair Plot, which can be created using ``sns.pairplot(...)``.
8283
It shows a scatterplot of each attribute plotted against each of the other attributes.
@@ -145,13 +146,13 @@ Then we apply the same rule to encode the island and sex columns. Although these
145146
146147
encoder = LabelEncoder()
147148
148-
# encode `species` column with 0=Adelie, 1=Chinstrap, and 2=Gentoo
149+
# encode "species" column with 0=Adelie, 1=Chinstrap, and 2=Gentoo
149150
penguins_classification.loc[:, 'species'] = encoder.fit_transform(penguins_classification['species'])
150151
151-
# encode `island` column with 0=Biscoe, 1=Dream and 2=Torgersen
152+
# encode "island" column with 0=Biscoe, 1=Dream and 2=Torgersen
152153
penguins_classification.loc[:, 'island'] = encoder.fit_transform(penguins_classification['island'])
153154
154-
# encode `sex` column 0=Female and 1=Male
155+
# encode "sex" column 0=Female and 1=Male
155156
penguins_classification.loc[:, 'sex'] = encoder.fit_transform(penguins_classification['sex'])
156157
157158
@@ -170,7 +171,7 @@ Separating features (X) from labels (y) ensures a clear distinction between what
170171
.. code-block:: python
171172
172173
X = penguins_classification.drop(['species'], axis=1)
173-
y = penguins_classification['species']
174+
y = penguins_classification['species'].astype('int')
174175
175176
176177
Splitting training and testing sets
@@ -183,17 +184,14 @@ This splitting is typically done using the ``train_test_split`` function from ``
183184
.. code-block:: python
184185
185186
from sklearn.model_selection import train_test_split
186-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
187-
print(f"Number of examples for training is {len(X_train)} and test is {len(X_test)}")
188-
189187
188+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
190189
191-
Training Model & Evaluating Model Performance
192-
---------------------------------------------
190+
print(f"Number of examples for training is {len(X_train)} and test is {len(X_test)}")
193191
194192
195-
After preparing the Penguins dataset by handling missing values, encoding categorical variables, and splitting it into features-labels and training-test datasets, the next step is to apply classification algorithms including k-Nearest Neighbors (KNN), Decision Trees, Random Forests, Naive Bayes, and Neural Networks to predict penguin species based on their physical measurements
196-
Each algorithm offers a unique approach to pattern recognition and generalization, and applying them to the same prepared dataset allows for a fair comparison of their predictive performance.
193+
Feature scaling
194+
^^^^^^^^^^^^^^^
197195

198196
Before training, it is also essential to ensure that numerical features are properly scaled via applying standardization or normalization -- especially for distance-based or gradient-based models -- to achieve optimal results.
199197

@@ -208,12 +206,20 @@ Before training, it is also essential to ensure that numerical features are prop
208206
X_test_scaled = scaler.transform(X_test)
209207
210208
211-
Below is the generic steps for training a model:
209+
210+
Training Model & Evaluating Model Performance
211+
---------------------------------------------
212+
213+
214+
After preparing the Penguins dataset by handling missing values, encoding categorical variables, and splitting it into features-labels and training-test datasets, the next step is to apply classification algorithms including k-Nearest Neighbors (KNN), Decision Trees, Random Forests, Naive Bayes, and Neural Networks to predict penguin species based on their physical measurements
215+
Each algorithm offers a unique approach to pattern recognition and generalization, and applying them to the same prepared dataset allows for a fair comparison of their predictive performance.
216+
217+
Below is the generic steps for representative algorithms we will use to training a model for penguins classification:
212218

213219
- choosing a model class and importing that model ``from sklearn.neighbors import XXXClassifier``
214220
- choosing the model hyperparameters by instantiating this class with desired values ``xxx_clf = XXXClassifier(<... hyperparameters ...>)``
215221
- training the model to the preprocessed train data by calling the ``fit()`` method of the model instance ``xxx_clf.fit(X_train_scaled, y_train)``
216-
- making predictions using the trained model on test data ``y_pred_xxx = xxx_clf.predict(X_test)``
222+
- making predictions using the trained model on test data ``y_pred_xxx = xxx_clf.predict(X_test_scaled)``
217223
- evaluating model’s performance using available metrics ``score_xxx = accuracy_score(y_test, y_pred_xxx)``
218224
- (optional) data visualization of confusion matrix and relevant data
219225

0 commit comments

Comments
 (0)