This project explores the classification of penguin species by evaluating two popular machine learning algorithms: K-Nearest Neighbours (KNN) and Naive Bayes. The analysis aims to determine which algorithm performs best for classification based on different metrics. The notebook includes data loading, cleaning, and visualisation, followed by model training (with and without cross-validation), hyperparameter tuning, and comprehensive evaluation. The performance of the models is assessed using confusion matrices, precision, recall, F1-score, ROC curves, and Area Under the Curve (AUC). Additionally, the impact of Principal Component Analysis (PCA) on the performance of these models is investigated.
This notebook covers the following steps:
- Installation: Installing the necessary R packages. π¦
- Data Setup: Loading and preparing the
palmerpenguinsdataset. π - Data Cleaning: Handling missing values and selecting relevant features. β¨
- Data Visualisation: Exploring the distribution of key features. π
- KNN without Cross-Validation: Training and evaluating a basic KNN model. π€
- Naive Bayes without Cross-Validation: Training and evaluating a basic Naive Bayes model. π€
- KNN with Cross-Validation: Implementing KNN with 10-fold cross-validation for improved hyperparameter tuning. π
- Naive Bayes with Cross-Validation: Implementing Naive Bayes with 10-fold cross-validation. π
- Principal Component Analysis (PCA): Investigating the effect of dimensionality reduction on model performance using KNN and Naive Bayes. π
- Evaluation: Comparing the performance of all models using confusion matrices, precision, recall, F1-score, ROC curves, and AUC. β
This project uses the palmerpenguins dataset. π§
This project uses the R programming language and several libraries, including tidyverse, palmerpenguins, caret, class, scales, ggplot2, pROC, naivebayes, and e1071. π
The following machine learning models are evaluated:
- K-Nearest Neighbours (KNN)
- Naive Bayes
To run this notebook, you need to have R installed. The required packages can be installed directly from within the R environment using the code in the notebook. β¬οΈ
The notebook presents the evaluation metrics for each model, both with and without cross-validation and PCA. The confusion matrices, precision, recall, F1-scores, ROC curves, and AUC values provide insights into the performance of KNN and Naive Bayes for penguin species classification on this dataset. π―
This project utilises the palmerpenguins dataset, generously provided by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. π
To use this notebook:
- Open the notebook in a compatible environment (like Google Colab with R kernel). π»
- Run the cells sequentially to follow the data analysis and model training process.
βΆοΈ - Examine the outputs and visualisations to understand the data and model performance. π