You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/04-supervised-ML-classification.rst
+20-14Lines changed: 20 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -46,6 +46,7 @@ Seaborn provides the Penguins dataset through its built-in data-loading function
46
46
import seaborn as sns
47
47
48
48
penguins = sns.load_dataset('penguins')
49
+
penguins
49
50
50
51
51
52
.. csv-table::
@@ -76,7 +77,7 @@ There are seven columns include:
76
77
- *body_mass_g*: body mass in grams
77
78
- *sex*: male or female
78
79
79
-
Looking at numbers from `penguins` `penguins.describe()` usually does not give a very good intuition about the data we are working with, we have the preference to visualize the data.
80
+
Looking at numbers from ``penguins`` and ``penguins.describe()`` usually does not give a very good intuition about the data we are working with, we have the preference to visualize the data.
80
81
81
82
One nice visualization for datasets with relatively few attributes is the Pair Plot, which can be created using ``sns.pairplot(...)``.
82
83
It shows a scatterplot of each attribute plotted against each of the other attributes.
@@ -145,13 +146,13 @@ Then we apply the same rule to encode the island and sex columns. Although these
145
146
146
147
encoder = LabelEncoder()
147
148
148
-
# encode `species` column with 0=Adelie, 1=Chinstrap, and 2=Gentoo
149
+
# encode "species" column with 0=Adelie, 1=Chinstrap, and 2=Gentoo
@@ -170,7 +171,7 @@ Separating features (X) from labels (y) ensures a clear distinction between what
170
171
.. code-block:: python
171
172
172
173
X = penguins_classification.drop(['species'], axis=1)
173
-
y = penguins_classification['species']
174
+
y = penguins_classification['species'].astype('int')
174
175
175
176
176
177
Splitting training and testing sets
@@ -183,17 +184,14 @@ This splitting is typically done using the ``train_test_split`` function from ``
183
184
.. code-block:: python
184
185
185
186
from sklearn.model_selection import train_test_split
186
-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
187
-
print(f"Number of examples for training is {len(X_train)} and test is {len(X_test)}")
188
-
189
187
188
+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
190
189
191
-
Training Model & Evaluating Model Performance
192
-
---------------------------------------------
190
+
print(f"Number of examples for training is {len(X_train)} and test is {len(X_test)}")
193
191
194
192
195
-
After preparing the Penguins dataset by handling missing values, encoding categorical variables, and splitting it into features-labels and training-test datasets, the next step is to apply classification algorithms including k-Nearest Neighbors (KNN), Decision Trees, Random Forests, Naive Bayes, and Neural Networks to predict penguin species based on their physical measurements
196
-
Each algorithm offers a unique approach to pattern recognition and generalization, and applying them to the same prepared dataset allows for a fair comparison of their predictive performance.
193
+
Feature scaling
194
+
^^^^^^^^^^^^^^^
197
195
198
196
Before training, it is also essential to ensure that numerical features are properly scaled via applying standardization or normalization -- especially for distance-based or gradient-based models -- to achieve optimal results.
199
197
@@ -208,12 +206,20 @@ Before training, it is also essential to ensure that numerical features are prop
208
206
X_test_scaled = scaler.transform(X_test)
209
207
210
208
211
-
Below is the generic steps for training a model:
209
+
210
+
Training Model & Evaluating Model Performance
211
+
---------------------------------------------
212
+
213
+
214
+
After preparing the Penguins dataset by handling missing values, encoding categorical variables, and splitting it into features-labels and training-test datasets, the next step is to apply classification algorithms including k-Nearest Neighbors (KNN), Decision Trees, Random Forests, Naive Bayes, and Neural Networks to predict penguin species based on their physical measurements
215
+
Each algorithm offers a unique approach to pattern recognition and generalization, and applying them to the same prepared dataset allows for a fair comparison of their predictive performance.
216
+
217
+
Below is the generic steps for representative algorithms we will use to training a model for penguins classification:
212
218
213
219
- choosing a model class and importing that model ``from sklearn.neighbors import XXXClassifier``
214
220
- choosing the model hyperparameters by instantiating this class with desired values ``xxx_clf = XXXClassifier(<... hyperparameters ...>)``
215
221
- training the model to the preprocessed train data by calling the ``fit()`` method of the model instance ``xxx_clf.fit(X_train_scaled, y_train)``
216
-
- making predictions using the trained model on test data ``y_pred_xxx = xxx_clf.predict(X_test)``
222
+
- making predictions using the trained model on test data ``y_pred_xxx = xxx_clf.predict(X_test_scaled)``
217
223
- evaluating model’s performance using available metrics ``score_xxx = accuracy_score(y_test, y_pred_xxx)``
218
224
- (optional) data visualization of confusion matrix and relevant data
0 commit comments