Skip to content

Commit e64573f

Browse files
committed
data splitting/Splitting features and labels in in episode 4
1 parent aba2f5f commit e64573f

File tree

1 file changed

+14
-1
lines changed

1 file changed

+14
-1
lines changed

content/04-supervised-ML-classification.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Supervised ML (I): Classification
1+
Supervised Learning (I): Classification
22
=================================
33

44

@@ -173,5 +173,18 @@ Separating features (X) from labels (y) ensures a clear distinction between what
173173
y = penguins_classification['species']
174174
175175
176+
Splitting training and testing sets
177+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
178+
179+
After separating features and labels in the penguins dataset, we further divide the data into a training set and a testing set. The training set is used to train the model, allowing it to learn patterns and relationships from the data, and the test set, on the other hand, is reserved for evaluating the model’s performance on unseen data. A common split is 80% for training and 20% for testing, which provides enough data for training while still retaining a meaningful test set.
180+
181+
This splitting is typically done using the ``train_test_split`` function from ``sklearn.model_selection``, with a fixed ``random_state`` to ensure reproducibility.
182+
183+
.. code-block:: python
184+
185+
from sklearn.model_selection import train_test_split
186+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
187+
print(f"Number of examples for training is {len(X_train)} and test is {len(X_test)}")
188+
176189
177190

0 commit comments

Comments
 (0)