Skip to content

Commit aba2f5f

Browse files
committed
data splitting/Splitting features and labels in in episode 4
1 parent 9cdf88f commit aba2f5f

File tree

1 file changed

+18
-0
lines changed

1 file changed

+18
-0
lines changed

content/04-supervised-ML-classification.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,5 +155,23 @@ Then we apply the same rule to encode the island and sex columns. Although these
155155
penguins_classification.loc[:, 'sex'] = encoder.fit_transform(penguins_classification['sex'])
156156
157157
158+
Data Splitting
159+
--------------
160+
161+
162+
Splitting features and labels
163+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
164+
165+
In preparing the penguins dataset for classification, we first need to split the data into features and labels. The target variable we aim to predict is the penguin species, which we encode into numeric labels using ``LabelEncoder``. This encoded species column will be the **label vector** (*e.g.*, **y**). The remaining columns -- such as bill length, bill depth, flipper length, body mass, and encoded categorical variables like island and sex -- constitute the **feature matrix** (*e.g.*, **X**). These features contain the input information the model will learn from.
166+
167+
Separating features (X) from labels (y) ensures a clear distinction between what the model uses for prediction and what it is trying to predict.
168+
169+
170+
.. code-block:: python
171+
172+
X = penguins_classification.drop(['species'], axis=1)
173+
y = penguins_classification['species']
174+
175+
158176
159177

0 commit comments

Comments
 (0)