k-Nearest Neighbors: Predict
Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict() method on the X that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.
In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as X_new. You will use your classifier to predict the label for this new data point, as well as on the training data X that the model has already seen. Using .predict() on X_new will generate 1 prediction, while using it on X will generate 435 predictions: 1 for each sample.
The DataFrame has been pre-loaded as df. This time, you will create the feature array X and target variable array y yourself.
from sklearn.neighbors import KNeighborsClassifier
y = df['party'].values X = df.drop('party', axis=1).values
knn = knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X,y)
y_pred = knn.predict(X)
new_prediction = knn.predict(X_new) print("Prediction: {}".format(new_prediction))
in classification, accuracy is a commonly used metric
Accuracy = # of correct predictions/Total number of data points
Which data do we use to compute accuracy? What we are really interested in is how well our model performs on new data. that is samples that the algorithm has never seen before. If we use the training data-set, that would not be a good classifier as this was seen by the model already.
So the common practice is to split our data in two sets. A training set and a test set and:
- fit/train the classifier on the training set
- make predication on the test set
- compare the predication with the known labels
Lets take a look at the code:
-
We first import the train test split from sklearn dot model selection
-
use the train test split function to randomly split our data
-
the first argument
'X'
is our feature data, the 2nd target or labels'y'
-
test_size
keyword argument specifies what proportion of the original data is used for the test set. -
random_state kwarg
set a seed for the random number generator that splits the data into train and test which splits the data in to train and test. Setting the seed with the same argument later will allow you to reproduce the exact split and your downstream results. -
train_set_split returns 4 arrays:
- training data
- test data
- training labels
- test labels
-
Next we unpack these into 4 variables of
X_train, X_test, y_train, y_test.
-
By default train_test_split, splits data into 75% training and 25% test data which is a good rule of thumb. Here we specify our size of split using the test_size to 30%.
-
It is also best practice to perform your split so that the split reflect the labels on your data. That is you want your labels to be distributed in train and test sets as they are in the original data-set. To achieve this we use the keyword argument stratify equals y, where y is the list or array containing the labels.
-
Next we instantiate our k-nearest neighbors classifier,
-
fit it to the training data
-
make our prediction using the test data and store the result in
y_pred
, printing them shown 3 values as expected. -
To check out the accuracy of our model we use the score method of the model and pass the
X_test and y_test.
**Note:
**As K increases, the decision boundary gets smoother and less curvy.
Larger K = smoother decision boundary - less complex model
Smaller K = More complex model = can lead to over fitting
Below you can see that there is a sweet spot in the middle that can give us the best performance on the test set.