From 12b20756fbed4b7b40263e2e8b61175c178f9351 Mon Sep 17 00:00:00 2001 From: Chris Mauck Date: Fri, 6 Jan 2023 10:59:25 -0600 Subject: [PATCH 1/3] Handling Mislabeled Tabular Data --- .DS_Store | Bin 0 -> 6148 bytes ...r_Data_to_Improve_Your_XGBoost_Model.ipynb | 657 ++++++++++++++++++ 2 files changed, 657 insertions(+) create mode 100644 .DS_Store create mode 100644 handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..bc6b0f6102af0c717b43e4e208fceffe4f335b90 GIT binary patch literal 6148 zcmeHK!Aj&n5Un0Z%^<=Y6!+NRRoU^NC|<&-Kd@Q!pv#z$puw1yNya%01G(xC`6qtQ zzUuA~W!96h3{nNvud2GLJ9!-BOHoOX>akyK=F2tS37STe32lENey>m_+8 z@~ZCjUR7(Uy}Ys-tOjerpXk<%qdKlfRX-kHviIAVBAMB7a-L`7L3iWCly#h!+0Yi` zX@-=mi@Z$DxNk;fTG`sjbp)NDGw5zjCWl7{JGwXB(azfI9qmu2onUkO=kc$>U2$KU zC&#_QXKH2L;|!i*e3^#FXjm4eyunzrd1m7a3IoD`FtAVz_)RKUTd1d!hY<#Zfkj|| z_XiDS3_Ugu?bd;Ye;>KO&By|q?-GpKW9YGQh#rV?r9fAz{1rpFa>TXI3q3XtT{$T` zGsf{VD}O^#c6P+IO(zvPlvWrJ20k*d\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
stud_IDexam_1exam_2exam_3notesletter_gradenoisy_letter_grade
0f48f73537793522
10bd4e7816480211
2e1795d748897511
3cb9d7a619478522
49acca4489091522
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + " \n", + " " + ] + }, + "metadata": {}, + "execution_count": 4 + } + ], + "source": [ + "!pip install cleanlab==2.2\n", + "!pip install xgboost==1.7\n", + "\n", + "from cleanlab.filter import find_label_issues\n", + "from xgboost import XGBClassifier\n", + "from sklearn import preprocessing\n", + "from sklearn.model_selection import cross_val_predict\n", + "from sklearn.metrics import accuracy_score\n", + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "df = pd.read_csv(\"https://s.cleanlab.ai/student-grades-demo.csv\")\n", + "df_c = df.copy()\n", + "\n", + "# Transform letter grades and notes to categorical numbers.\n", + "# Necessary for XGBoost and cleanlab.\n", + "df['letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['letter_grade'])\n", + "df['noisy_letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['noisy_letter_grade'])\n", + "df['notes'] = preprocessing.LabelEncoder().fit_transform(df[\"notes\"])\n", + "df['notes'] = df['notes'].astype('category')\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Get What We Need\n", + "\n", + "We need to obtain **out-of-sample** predicted probabilities for all of our data in order to provide the `find_label_issues()` method with the necessary input. To do this, we will use XGBoost which is commonly used with tabular data. Specifically, getting the predicted probabilities can be achieved through the use of a `XGBClassifier` model in conjunction with cross-validation, which can be implemented easily using the `cross_val_predict` function from scikit-learn.\n", + "\n", + "If our tabular data consisted solely of numerical and boolean values, we could potentially utilize a simpler model such as a nearest-neighbor or logistic regression. However, our data includes a notes column, which we will treat as a categorical feature. Fortunately, XGBoost (>v1.6) is able to handle mixed data types (numerical and categorical) by setting the `enable_categorical` parameter to `true`, thereby simplifying the modeling process.\"" + ], + "metadata": { + "id": "dVH_iciASD9F" + } + }, + { + "cell_type": "code", + "source": [ + "# Train model on noisy labels.\n", + "# Convert numerical notes label encoding to categorical.\n", + "data = df.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n", + "labels = df['noisy_letter_grade']\n", + "\n", + "# XGBoost(experimental) supports categorical data.\n", + "# Here we use default hyperparameters for simplicity.\n", + "# Get out-of-sample predicted probabilities and check model accuracy.\n", + "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n", + "pred_probs = cross_val_predict(model, data, labels, method='predict_proba')\n", + "preds = np.argmax(pred_probs, axis=1)\n", + "\n", + "acc_original = accuracy_score(preds, labels)\n", + "print(f\"Accuracy with original data: {round(acc_original*100,1)}%\")" + ], + "metadata": { + "id": "gCS19IqJsQUL", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "fd89b945-6793-4b3a-a017-7b19f8e6a29b" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Accuracy with original data: 67.4%\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Using the default hyperparameters, our cross-validated XGBoost model demonstrates an accuracy of 67.3% when predicting the noisy labels. This level of performance on such a basic task is unsatisfactory. It appears that the presence of 20% label noise is significantly disrupting the model's ability to accurately predict the labels." + ], + "metadata": { + "id": "lTQ_iB-JSWUl" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Find Label Issues\n", + "\n", + "In just one line of code we get a list of possible label issues - it really is that easy! Top 5 results shown below.\n", + "\n", + "Let's take a look at a few of the errors cleanlab has found. Take a look at row 2, where the student cheated on exam 1 and got grades of 0, 96, and 90 which should result in a 'D' yet was accidentally labeled as a 'B'. In row 5, the student missed homework resulting in a deduction of 10 points from the overall average, receiving exam grades of 97, 86, and 68 (averages to 83, overall 73 with the deduction) which should result in a 'C' yet was accidentally labeled as an 'A'. " + ], + "metadata": { + "id": "klDe2ag8SZ2T" + } + }, + { + "cell_type": "code", + "source": [ + "# Returns list of indices of label issues, sorted by self_confidence.\n", + "issue_idx = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence')\n", + "# Filter original data to show some issues.\n", + "issues_df = df_c.iloc[issue_idx]\n", + "# Show a few good examples.\n", + "issues_df.iloc[13:18]" + ], + "metadata": { + "id": "SfJ83uP-Xski", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "95ada77a-9ff2-4505-bec4-1255ef1f171e" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " stud_ID exam_1 exam_2 exam_3 notes \\\n", + "23 5eef2c 90 83 51 NaN \n", + "159 b3a1a5 0 96 90 cheated on exam, gets 0pts \n", + "301 4591b4 66 72 83 missed homework frequently -10 \n", + "71 38a6ec 88 67 74 NaN \n", + "885 f00c02 97 86 68 missed homework frequently -10 \n", + "\n", + " letter_grade noisy_letter_grade \n", + "23 C A \n", + "159 D B \n", + "301 D B \n", + "71 C A \n", + "885 C A " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
stud_IDexam_1exam_2exam_3notesletter_gradenoisy_letter_grade
235eef2c908351NaNCA
159b3a1a509690cheated on exam, gets 0ptsDB
3014591b4667283missed homework frequently -10DB
7138a6ec886774NaNCA
885f00c02978668missed homework frequently -10CA
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ] + }, + "metadata": {}, + "execution_count": 6 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# How'd We Do?\n", + "\n", + "Let's go a step further and see how cleanlab did at automatically identifying which data points are mislabeled. If we take the intersection of the labels errors identified by cleanlab and the true label errors, we see that cleanlab was able to identify 80% of the label errors correctly (based on predictions from a model that is only 67% accurate). " + ], + "metadata": { + "id": "PrvJHkPzSq6Q" + } + }, + { + "cell_type": "code", + "source": [ + "# Computing percentage of true errors identified. \n", + "true_error_idx = df[df.letter_grade != df.noisy_letter_grade].index.values\n", + "cl_acc = len(set(true_error_idx).intersection(set(issue_idx)))/len(true_error_idx)\n", + "print(f\"Percentage of errors found: {round(cl_acc*100,1)}%\")" + ], + "metadata": { + "id": "9O2a6urWc1DA", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "f88b5ce6-33f2-4ef6-e774-19013c33f0e8" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Percentage of errors found: 79.8%\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Train a More Robust Model\n", + "\n", + "Now that we have the indices of potential label errors within our data, let's remove them from our data, retrain our model, and see what improvement we can gain.\n", + "\n", + "Keep in mind our baseline model from above, trained on the original data using the `noisy_letter_grade` as the prediction label, which achieved a cross-validation accuracy of 67%.\n", + "\n", + "Let's use a very simple method to handle these label errors and just drop them entirely from the data and retrain our exact same `XGBClassifier`." + ], + "metadata": { + "id": "YzxXoDOqSzn-" + } + }, + { + "cell_type": "code", + "source": [ + "# Remove the label errors found by cleanlab.\n", + "data = df.drop(issue_idx)\n", + "labels = data['noisy_letter_grade']\n", + "data = data.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n", + "\n", + "# Train a more robust classifier with less erroneous data.\n", + "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n", + "pred_probs = cross_val_predict(model, data, labels, method='predict_proba')\n", + "preds = np.argmax(pred_probs, axis=1)\n", + "\n", + "acc_clean = accuracy_score(preds, labels)\n", + "print(f\"Accuracy with original data: {round(acc_original*100, 1)}%\")\n", + "print(f\"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%\")\n", + "\n", + "# Compute reduction in error.\n", + "err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)\n", + "print(f\"Reduction in error: {round(err*100,1)}%\")" + ], + "metadata": { + "id": "FsQFmy7xgSUa", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "a33e17ab-c197-4f95-c9c1-0473c16af313" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Accuracy with original data: 67.4%\n", + "Accuracy with errors found by cleanlab removed: 90.1%\n", + "Reduction in error: 69.7%\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "After removing the suspected label issues, our model's new cross-validation accuracy is now 90%, which means we **reduced the error-rate of the model by 70%** (the original model had 67% accuracy). \n", + "\n", + "**Note: throughout this entire process we never changed any code related to model architecture/hyperparameters, training, or data preprocessing! This improvement is strictly coming from increasing the quality of our data which leaves additional room for additional optimizations on the modeling side.**" + ], + "metadata": { + "id": "9J9clVf1UzQZ" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Conclusion\n", + "\n", + "For the student grades dataset, we found that simply dropping identified label errors and retraining the model resulted in a 70% reduction in prediction error on our classification problem (with accuracy improving from 67% to 90%). By using cleanlab to ensure the integrity of your data, you can mitigate costly labeling errors and boost the performance of your models.\n", + "\n", + "[Cleanlab GitHub](https://github.com/cleanlab/cleanlab)" + ], + "metadata": { + "id": "-W-Lo82SVp7I" + } + } + ] +} \ No newline at end of file From f19e1dbe7111e51ce4f8b2f78dd8b46a1ac845b5 Mon Sep 17 00:00:00 2001 From: Chris Mauck <38672284+cmauck10@users.noreply.github.com> Date: Wed, 11 Jan 2023 14:21:26 -0600 Subject: [PATCH 2/3] Add colab link --- ..._Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb | 2 ++ 1 file changed, 2 insertions(+) diff --git a/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb index 3b74fe3..5944945 100644 --- a/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb +++ b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb @@ -19,6 +19,8 @@ "source": [ "# Handling Mislabeled Tabular Data to Improve Your XGBoost Model\n", "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cmauck10/towardsai-tutorials/blob/master/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb)\n", + "\n", "This notebook highlights data-centric AI techniques (using [cleanlab](https://github.com/cleanlab/cleanlab)) to improve the accuracy of an XGBoost classifier (reducing prediction errors by 70% on the noisy dataset considered here!). These techniques involve optimizing the dataset itself rather than altering the model's architecture or hyperparameters. As a result, it is possible to achieve further improvements in accuracy by fine-tuning the model in conjunction with the newly enhanced data. Additionally, the enhancements made to the dataset through these methods are transferable to other modeling and analytical endeavors, as opposed to being specific to a particular type of model.\n", "\n", "At a high level we will:\n", From 83d1ac79e05a266d7c46c1b82c1030faee12499e Mon Sep 17 00:00:00 2001 From: Chris Mauck <38672284+cmauck10@users.noreply.github.com> Date: Tue, 7 Feb 2023 22:04:06 -0600 Subject: [PATCH 3/3] Update methodology and copy. --- ...r_Data_to_Improve_Your_XGBoost_Model.ipynb | 1141 ++++++++--------- 1 file changed, 500 insertions(+), 641 deletions(-) diff --git a/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb index 5944945..32529fe 100644 --- a/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb +++ b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb @@ -1,659 +1,518 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "zwfWqPeA1zX0" + }, + "source": [ + "# Handling Mislabeled Tabular Data to Improve Your XGBoost Model\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cmauck10/towardsai-tutorials/blob/master/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb)\n", + "\n", + "This notebook highlights data-centric AI techniques (using [cleanlab](https://github.com/cleanlab/cleanlab)) to improve the accuracy of an XGBoost classifier (reducing prediction errors by 70% on the noisy dataset considered here!). These techniques involve optimizing the dataset itself rather than altering the model's architecture or hyperparameters. As a result, it is possible to achieve further improvements in accuracy by fine-tuning the model in conjunction with the newly enhanced data. Additionally, the enhancements made to the dataset through these methods are transferable to other modeling and analytical endeavors, as opposed to being specific to a particular type of model.\n", + "\n", + "At a high level we will:\n", + "- Establish a baseline accuracy of XGBoost model on the original data.\n", + "- Use cleanlab's `find_label_issues()` to highlight hundreds of mislabeled data points. \n", + "- Remove the data with automatically-flagged label issues from the dataset, and then retrain the exact same XGBoost model. This simple step **reduces the error in model predictions by 36%!** The raw difference in accuracy values between the two XGBoost models **8%**.\n", + "- Manually correct the label issues of all examples found by `find_label_issues()`, which **reduces the error in model predictions by 70%** from the baseline, identical XGBoost model!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "je7P55z4RwX_" + }, + "source": [ + "## Setup and Data Processing\n", + "\n", + "Let's take a look at our student grades tabular dataset. The data includes three exam scores (numerical features), a written note (categorical feature with missing values), and a (noisy) letter grade (categorical label). Our aim is to train a model to classify the grade for each student based on the other features, but 20% of the grade labels in this dataset are actually incorrect.\n", + "\n", + "However, for this demonstration, we have access to the true letter grade each student should've received, which we use for evaluating both the underlying accuracy of model predictions and how well cleanlab detects which data are mislabeled. These true grades are only reserved for evaluation, they are not present in the dataset used for ML.\n", + "\n", + "In your noisily-labeled datasets, there will typically be no such ground truth, and therefore addressing label issues is even more important to facilitate proper model evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { "colab": { - "provenance": [] + "base_uri": "https://localhost:8080/", + "height": 206 }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" - } + "id": "nQVmMBQOS43j", + "outputId": "54a45659-ecb6-47fe-acfa-d0424027af47" + }, + "outputs": [], + "source": [ + "# !pip install cleanlab==2.2\n", + "# !pip install xgboost==1.7\n", + "\n", + "from cleanlab.filter import find_label_issues\n", + "from xgboost import XGBClassifier\n", + "from sklearn import preprocessing\n", + "from sklearn.model_selection import cross_val_predict\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics import accuracy_score\n", + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "df = pd.read_csv(\"https://s.cleanlab.ai/student-grades-demo.csv\")\n", + "df_c = df.copy()\n", + "\n", + "# Transform letter grades and notes to categorical numbers.\n", + "# Necessary for XGBoost and cleanlab.\n", + "df['letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['letter_grade'])\n", + "df['noisy_letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['noisy_letter_grade'])\n", + "df['notes'] = preprocessing.LabelEncoder().fit_transform(df[\"notes\"])\n", + "df['notes'] = df['notes'].astype('category')\n", + "\n", + "# Split data for evaluation and set test data.\n", + "df_train, df_test = train_test_split(df, random_state=0)\n", + "df_train.reset_index(drop=True, inplace=True)\n", + "df_test.reset_index(drop=True, inplace=True)\n", + "test_data = df_test.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n", + "test_labels = df_test['letter_grade']" + ] }, - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# Handling Mislabeled Tabular Data to Improve Your XGBoost Model\n", - "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cmauck10/towardsai-tutorials/blob/master/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb)\n", - "\n", - "This notebook highlights data-centric AI techniques (using [cleanlab](https://github.com/cleanlab/cleanlab)) to improve the accuracy of an XGBoost classifier (reducing prediction errors by 70% on the noisy dataset considered here!). These techniques involve optimizing the dataset itself rather than altering the model's architecture or hyperparameters. As a result, it is possible to achieve further improvements in accuracy by fine-tuning the model in conjunction with the newly enhanced data. Additionally, the enhancements made to the dataset through these methods are transferable to other modeling and analytical endeavors, as opposed to being specific to a particular type of model.\n", - "\n", - "At a high level we will:\n", - "- Establish a baseline XGBoost model accuracy on the original data.\n", - "- Use cleanlab's `find_label_issues()` to highlight hundreds of mislabeled data points. \n", - "- Remove the data with automatically-flagged label issues from the dataset, and then retrain the exact same XGBoost model. This simple step reduces the error in model predictions by **70%!** The raw difference in accuracy values between the two XGBoost models is a whopping **23%**." - ], - "metadata": { - "id": "zwfWqPeA1zX0" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Setup and Data Processing\n", - "\n", - "Let's take a look at our student grades tabular dataset. The data includes three exam scores (numerical features), a written note (categorical feature with missing values), and a (noisy) letter grade (categorical label). Our aim is to train a model to classify the grade for each student based on the other features, but 20% of the grade labels in this dataset are actually incorrect.\n", - "\n", - "However, for this demonstration, we have access to the true letter grade each student should've received, which we use for evaluating both the underlying accuracy of model predictions and how well cleanlab detects which data are mislabeled. These true grades are only reserved for evaluation, they are not present in the dataset used for ML.\n", - "\n", - "In your noisily-labeled datasets, there will typically be no such ground truth, and therefore addressing label issues is even more important to facilitate proper model evaluation." - ], - "metadata": { - "id": "je7P55z4RwX_" - } - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 206 - }, - "id": "nQVmMBQOS43j", - "outputId": "54a45659-ecb6-47fe-acfa-d0424027af47" - }, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " stud_ID exam_1 exam_2 exam_3 notes letter_grade noisy_letter_grade\n", - "0 f48f73 53 77 93 5 2 2\n", - "1 0bd4e7 81 64 80 2 1 1\n", - "2 e1795d 74 88 97 5 1 1\n", - "3 cb9d7a 61 94 78 5 2 2\n", - "4 9acca4 48 90 91 5 2 2" - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
stud_IDexam_1exam_2exam_3notesletter_gradenoisy_letter_grade
0f48f73537793522
10bd4e7816480211
2e1795d748897511
3cb9d7a619478522
49acca4489091522
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ] - }, - "metadata": {}, - "execution_count": 4 - } - ], - "source": [ - "!pip install cleanlab==2.2\n", - "!pip install xgboost==1.7\n", - "\n", - "from cleanlab.filter import find_label_issues\n", - "from xgboost import XGBClassifier\n", - "from sklearn import preprocessing\n", - "from sklearn.model_selection import cross_val_predict\n", - "from sklearn.metrics import accuracy_score\n", - "import pandas as pd\n", - "import numpy as np\n", - "\n", - "df = pd.read_csv(\"https://s.cleanlab.ai/student-grades-demo.csv\")\n", - "df_c = df.copy()\n", - "\n", - "# Transform letter grades and notes to categorical numbers.\n", - "# Necessary for XGBoost and cleanlab.\n", - "df['letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['letter_grade'])\n", - "df['noisy_letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['noisy_letter_grade'])\n", - "df['notes'] = preprocessing.LabelEncoder().fit_transform(df[\"notes\"])\n", - "df['notes'] = df['notes'].astype('category')\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "source": [ - "# Get What We Need\n", - "\n", - "We need to obtain **out-of-sample** predicted probabilities for all of our data in order to provide the `find_label_issues()` method with the necessary input. To do this, we will use XGBoost which is commonly used with tabular data. Specifically, getting the predicted probabilities can be achieved through the use of a `XGBClassifier` model in conjunction with cross-validation, which can be implemented easily using the `cross_val_predict` function from scikit-learn.\n", - "\n", - "If our tabular data consisted solely of numerical and boolean values, we could potentially utilize a simpler model such as a nearest-neighbor or logistic regression. However, our data includes a notes column, which we will treat as a categorical feature. Fortunately, XGBoost (>v1.6) is able to handle mixed data types (numerical and categorical) by setting the `enable_categorical` parameter to `true`, thereby simplifying the modeling process.\"" - ], - "metadata": { - "id": "dVH_iciASD9F" - } - }, - { - "cell_type": "code", - "source": [ - "# Train model on noisy labels.\n", - "# Convert numerical notes label encoding to categorical.\n", - "data = df.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n", - "labels = df['noisy_letter_grade']\n", - "\n", - "# XGBoost(experimental) supports categorical data.\n", - "# Here we use default hyperparameters for simplicity.\n", - "# Get out-of-sample predicted probabilities and check model accuracy.\n", - "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n", - "pred_probs = cross_val_predict(model, data, labels, method='predict_proba')\n", - "preds = np.argmax(pred_probs, axis=1)\n", - "\n", - "acc_original = accuracy_score(preds, labels)\n", - "print(f\"Accuracy with original data: {round(acc_original*100,1)}%\")" - ], - "metadata": { - "id": "gCS19IqJsQUL", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "fd89b945-6793-4b3a-a017-7b19f8e6a29b" - }, - "execution_count": null, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Accuracy with original data: 67.4%\n" - ] - } - ] - }, - { - "cell_type": "markdown", - "source": [ - "Using the default hyperparameters, our cross-validated XGBoost model demonstrates an accuracy of 67.3% when predicting the noisy labels. This level of performance on such a basic task is unsatisfactory. It appears that the presence of 20% label noise is significantly disrupting the model's ability to accurately predict the labels." - ], - "metadata": { - "id": "lTQ_iB-JSWUl" - } - }, - { - "cell_type": "markdown", - "source": [ - "# Find Label Issues\n", - "\n", - "In just one line of code we get a list of possible label issues - it really is that easy! Top 5 results shown below.\n", - "\n", - "Let's take a look at a few of the errors cleanlab has found. Take a look at row 2, where the student cheated on exam 1 and got grades of 0, 96, and 90 which should result in a 'D' yet was accidentally labeled as a 'B'. In row 5, the student missed homework resulting in a deduction of 10 points from the overall average, receiving exam grades of 97, 86, and 68 (averages to 83, overall 73 with the deduction) which should result in a 'C' yet was accidentally labeled as an 'A'. " - ], - "metadata": { - "id": "klDe2ag8SZ2T" - } + { + "cell_type": "markdown", + "metadata": { + "id": "dVH_iciASD9F" + }, + "source": [ + "# Train and Evaluate XGBoost Classifier\n", + "\n", + "Now that we’ve seen what can be achieved with cleanlab, let’s take a look at how we get there.\n", + "\n", + "For our model of choice, we will use XGBoost, an implementation of gradient-boosting decision trees (GBDT), which are commonly used with tabular data. If our tabular data consisted solely of numerical and boolean values, we could potentially utilize a simpler model such as a nearest-neighbor or logistic regression. However, our data includes a notes column, which we will treat as a categorical feature. Fortunately, XGBoost (>v1.6) is able to handle mixed data types (numerical and categorical) by setting the `enable_categorical` parameter to `true`, thereby simplifying the modeling process." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "gCS19IqJsQUL", + "outputId": "fd89b945-6793-4b3a-a017-7b19f8e6a29b" + }, + "outputs": [ { - "cell_type": "code", - "source": [ - "# Returns list of indices of label issues, sorted by self_confidence.\n", - "issue_idx = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence')\n", - "# Filter original data to show some issues.\n", - "issues_df = df_c.iloc[issue_idx]\n", - "# Show a few good examples.\n", - "issues_df.iloc[13:18]" - ], - "metadata": { - "id": "SfJ83uP-Xski", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 206 - }, - "outputId": "95ada77a-9ff2-4505-bec4-1255ef1f171e" - }, - "execution_count": null, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - " stud_ID exam_1 exam_2 exam_3 notes \\\n", - "23 5eef2c 90 83 51 NaN \n", - "159 b3a1a5 0 96 90 cheated on exam, gets 0pts \n", - "301 4591b4 66 72 83 missed homework frequently -10 \n", - "71 38a6ec 88 67 74 NaN \n", - "885 f00c02 97 86 68 missed homework frequently -10 \n", - "\n", - " letter_grade noisy_letter_grade \n", - "23 C A \n", - "159 D B \n", - "301 D B \n", - "71 C A \n", - "885 C A " - ], - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
stud_IDexam_1exam_2exam_3notesletter_gradenoisy_letter_grade
235eef2c908351NaNCA
159b3a1a509690cheated on exam, gets 0ptsDB
3014591b4667283missed homework frequently -10DB
7138a6ec886774NaNCA
885f00c02978668missed homework frequently -10CA
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ] - }, - "metadata": {}, - "execution_count": 6 - } - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "Accuracy with original data: 79.2%\n" + ] + } + ], + "source": [ + "# Train model on noisy labels.\n", + "train_data = df_train.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n", + "train_labels = df_train['noisy_letter_grade']\n", + "\n", + "# XGBoost(experimental) supports categorical data.\n", + "# Here we use default hyperparameters for simplicity.\n", + "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n", + "model.fit(train_data, train_labels)\n", + "\n", + "# Evaluate model on test split with ground truth labels.\n", + "preds = model.predict(test_data)\n", + "acc_original = accuracy_score(preds, test_labels)\n", + "print(f\"Accuracy with original data: {round(acc_original*100,1)}%\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lTQ_iB-JSWUl" + }, + "source": [ + "Using the default hyperparameters, our baseline XGBoost model demonstrates an accuracy of 79.2% when trained on the noisy labels and predicting the test set. It appears that the presence of 20% label noise is significantly disrupting the model’s ability to accurately predict the labels on such a trivial task." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "klDe2ag8SZ2T" + }, + "source": [ + "# Find Label Issues\n", + "\n", + "In order to use cleanlab, we need to obtain **out-of-sample** predicted probabilities for all of our training data in order to provide the `find_label_issues()` method with the necessary input. Getting the predicted probabilities can be achieved through the use of our `XGBClassifier` model with cross-validation, which can be implemented easily using the `cross_val_predict` function from scikit-learn.\n", + "\n", + "In just a few lines of code, we get a list of possible label issues! A few of the top results are shown below." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 }, + "id": "SfJ83uP-Xski", + "outputId": "95ada77a-9ff2-4505-bec4-1255ef1f171e" + }, + "outputs": [ { - "cell_type": "markdown", - "source": [ - "# How'd We Do?\n", - "\n", - "Let's go a step further and see how cleanlab did at automatically identifying which data points are mislabeled. If we take the intersection of the labels errors identified by cleanlab and the true label errors, we see that cleanlab was able to identify 80% of the label errors correctly (based on predictions from a model that is only 67% accurate). " - ], - "metadata": { - "id": "PrvJHkPzSq6Q" - } + "name": "stderr", + "output_type": "stream", + "text": [ + "2023-02-07 22:01:11.846308: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n", + "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", + "/var/folders/0l/js52t50n2d71fm_0mhtlwx880000gn/T/ipykernel_37207/2155506160.py:11: SettingWithCopyWarning: \n", + "A value is trying to be set on a copy of a slice from a DataFrame\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + " issues_df.sort_values(by=\"stud_ID\", key=lambda column: column.map(lambda e: issue_stud_id.index(e)), inplace=True)\n" + ] }, { - "cell_type": "code", - "source": [ - "# Computing percentage of true errors identified. \n", - "true_error_idx = df[df.letter_grade != df.noisy_letter_grade].index.values\n", - "cl_acc = len(set(true_error_idx).intersection(set(issue_idx)))/len(true_error_idx)\n", - "print(f\"Percentage of errors found: {round(cl_acc*100,1)}%\")" + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
stud_IDexam_1exam_2exam_3notesletter_gradenoisy_letter_grade
76575ce98918981NaNBF
7443d0fdf907495great participation +10AF
63777c9c507965cheated on exam, gets 0ptsFA
404bb13f4659568NaNCA
2173c4cbb946266NaNCC
\n", + "
" ], - "metadata": { - "id": "9O2a6urWc1DA", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "f88b5ce6-33f2-4ef6-e774-19013c33f0e8" - }, - "execution_count": null, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Percentage of errors found: 79.8%\n" - ] - } + "text/plain": [ + " stud_ID exam_1 exam_2 exam_3 notes letter_grade \\\n", + "765 75ce98 91 89 81 NaN B \n", + "744 3d0fdf 90 74 95 great participation +10 A \n", + "637 77c9c5 0 79 65 cheated on exam, gets 0pts F \n", + "404 bb13f4 65 95 68 NaN C \n", + "217 3c4cbb 94 62 66 NaN C \n", + "\n", + " noisy_letter_grade \n", + "765 F \n", + "744 F \n", + "637 A \n", + "404 A \n", + "217 C " ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Get predicted probabilities through cross validation.\n", + "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n", + "pred_probs = cross_val_predict(model, train_data, train_labels, method='predict_proba')\n", + "\n", + "# Returns list of indices of label issues, sorted by self_confidence.\n", + "issue_idx = find_label_issues(train_labels, pred_probs, return_indices_ranked_by='self_confidence')\n", + "\n", + "# Filter original data to show students with grade issues.\n", + "issue_stud_id = df_train.iloc[issue_idx].stud_ID.values.tolist()\n", + "issues_df = df_c[df_c['stud_ID'].isin(issue_stud_id)]\n", + "issues_df.sort_values(by=\"stud_ID\", key=lambda column: column.map(lambda e: issue_stud_id.index(e)), inplace=True)\n", + "\n", + "# Show a few good examples.\n", + "issues_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let’s take a look at a few of the label issues automatically identified in our dataset. Take a look at row 1, where the student got grades of 91, 89, and 81, which should result in a ‘B’ yet was accidentally labeled as an ‘F’. In row 2, the student had great participation resulting in an addition of 10 points to the overall average, receiving exam grades of 90, 74, and 95 (averages to 86.3, overall 96.3 with the bonus), which should result in a ‘A’ yet was accidentally labeled as an ‘F’.\n", + "\n", + "**Note: `find_label_issues` is able to determine that the given label is incorrect, without ever seeing the ground truth label `letter_grade`.**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PrvJHkPzSq6Q" + }, + "source": [ + "# How'd We Do?\n", + "\n", + "Let's go a step further and see how cleanlab did at automatically identifying which data points are mislabeled. If we take the intersection of the label errors identified by cleanlab and the true label errors, we see that cleanlab was able to identify 83% of the label errors correctly (based on predictions from a model that is only 79% accurate). " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "9O2a6urWc1DA", + "outputId": "f88b5ce6-33f2-4ef6-e774-19013c33f0e8" + }, + "outputs": [ { - "cell_type": "markdown", - "source": [ - "# Train a More Robust Model\n", - "\n", - "Now that we have the indices of potential label errors within our data, let's remove them from our data, retrain our model, and see what improvement we can gain.\n", - "\n", - "Keep in mind our baseline model from above, trained on the original data using the `noisy_letter_grade` as the prediction label, which achieved a cross-validation accuracy of 67%.\n", - "\n", - "Let's use a very simple method to handle these label errors and just drop them entirely from the data and retrain our exact same `XGBClassifier`." - ], - "metadata": { - "id": "YzxXoDOqSzn-" - } - }, - { - "cell_type": "code", - "source": [ - "# Remove the label errors found by cleanlab.\n", - "data = df.drop(issue_idx)\n", - "labels = data['noisy_letter_grade']\n", - "data = data.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n", - "\n", - "# Train a more robust classifier with less erroneous data.\n", - "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n", - "pred_probs = cross_val_predict(model, data, labels, method='predict_proba')\n", - "preds = np.argmax(pred_probs, axis=1)\n", - "\n", - "acc_clean = accuracy_score(preds, labels)\n", - "print(f\"Accuracy with original data: {round(acc_original*100, 1)}%\")\n", - "print(f\"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%\")\n", - "\n", - "# Compute reduction in error.\n", - "err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)\n", - "print(f\"Reduction in error: {round(err*100,1)}%\")" - ], - "metadata": { - "id": "FsQFmy7xgSUa", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "a33e17ab-c197-4f95-c9c1-0473c16af313" - }, - "execution_count": null, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Accuracy with original data: 67.4%\n", - "Accuracy with errors found by cleanlab removed: 90.1%\n", - "Reduction in error: 69.7%\n" - ] - } - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "Percentage of errors found: 82.9%\n" + ] + } + ], + "source": [ + "# Computing percentage of true errors identified. \n", + "true_error_idx = df_train[df_train.letter_grade != df_train.noisy_letter_grade].index.values\n", + "cl_acc = len(set(true_error_idx).intersection(set(issue_idx)))/len(true_error_idx)\n", + "print(f\"Percentage of errors found: {round(cl_acc*100,1)}%\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YzxXoDOqSzn-" + }, + "source": [ + "# Retraining a More Robust Model\n", + "\n", + "Now that we have the indices of potential label errors within our data, let's remove them from our data, retrain our model, and see what improvement we can gain.\n", + "\n", + "Keep in mind our baseline model from above, trained on the original data using the `noisy_letter_grade` as the prediction label, which achieved a cross-validation accuracy of 79%.\n", + "\n", + "Let's use a very simple method to handle these label errors and just drop them entirely from the data and retrain our exact same `XGBClassifier`." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "FsQFmy7xgSUa", + "outputId": "a33e17ab-c197-4f95-c9c1-0473c16af313" + }, + "outputs": [ { - "cell_type": "markdown", - "source": [ - "After removing the suspected label issues, our model's new cross-validation accuracy is now 90%, which means we **reduced the error-rate of the model by 70%** (the original model had 67% accuracy). \n", - "\n", - "**Note: throughout this entire process we never changed any code related to model architecture/hyperparameters, training, or data preprocessing! This improvement is strictly coming from increasing the quality of our data which leaves additional room for additional optimizations on the modeling side.**" - ], - "metadata": { - "id": "9J9clVf1UzQZ" - } - }, + "name": "stdout", + "output_type": "stream", + "text": [ + "Accuracy with original data: 79.2%\n", + "Accuracy with errors found by cleanlab removed: 86.9%\n", + "Reduction in error: 36.7%\n" + ] + } + ], + "source": [ + "# Remove the label errors found by cleanlab.\n", + "train_data_cl = df_train.drop(issue_idx)\n", + "train_labels_cl = train_data_cl['noisy_letter_grade']\n", + "train_data_cl = train_data_cl.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n", + "\n", + "# Train a more robust classifier with less erroneous data.\n", + "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n", + "model.fit(train_data_cl, train_labels_cl)\n", + "\n", + "# Evaluate model on test split with ground truth labels.\n", + "preds = model.predict(test_data)\n", + "acc_clean = accuracy_score(preds, test_labels)\n", + "print(f\"Accuracy with original data: {round(acc_original*100, 1)}%\")\n", + "print(f\"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%\")\n", + "\n", + "# Compute reduction in error.\n", + "err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)\n", + "print(f\"Reduction in error: {round(err*100,1)}%\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9J9clVf1UzQZ" + }, + "source": [ + "After removing the suspected label issues, our model's new cross-validation accuracy is now 87%, which means we **reduced the error-rate of the model by 36%** (the original model had 79% accuracy). \n", + "\n", + "**Note: throughout this entire process we never changed any code related to model architecture/hyperparameters, training, or data preprocessing! This improvement is strictly coming from increasing the quality of our data which leaves additional room for additional optimizations on the modeling side.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Fixing Label Errors\n", + "\n", + "Instead of just dropping the potential label issues, the smarter (yet more time-intensive) way to increase our data quality would be to correct the automatically-identified label issues by hand. This simultaneously removes a noisy data point and adds an accurate one.\n", + "\n", + "I reviewed the potential label errors identified by `find_label_issues()` and made adjustments to the labels as needed." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "source": [ - "# Conclusion\n", - "\n", - "For the student grades dataset, we found that simply dropping identified label errors and retraining the model resulted in a 70% reduction in prediction error on our classification problem (with accuracy improving from 67% to 90%). By using cleanlab to ensure the integrity of your data, you can mitigate costly labeling errors and boost the performance of your models.\n", - "\n", - "[Cleanlab GitHub](https://github.com/cleanlab/cleanlab)" - ], - "metadata": { - "id": "-W-Lo82SVp7I" - } + "name": "stdout", + "output_type": "stream", + "text": [ + "Accuracy with original data: 79.2%\n", + "Accuracy with errors found by cleanlab removed: 86.9%\n", + "Accuracy with errors manually fixed: 93.6%\n", + "\n", + "Reduction in error using cleanlab opensource over baseline: 36.7%\n", + "Reduction in error using manual correction over opensource: 51.6%\n", + "Reduction in error using manual correction over baseline: 69.4%\n" + ] } - ] -} \ No newline at end of file + ], + "source": [ + "# Get the manually corrected data and split to match training subset.\n", + "clean_df = pd.read_csv(\"https://s.cleanlab.ai/student-grades-demo-studio-export.csv\")\n", + "train_students = df_train.stud_ID.values\n", + "clean_df = clean_df[clean_df['stud_ID'].isin(train_students)]\n", + "\n", + "# Same pre-processing as above.\n", + "clean_df['cleanlab_suggested_label'] = preprocessing.LabelEncoder().fit_transform(clean_df['cleanlab_suggested_label'])\n", + "clean_df['notes'] = preprocessing.LabelEncoder().fit_transform(clean_df[\"notes\"])\n", + "clean_df['notes'] = clean_df['notes'].astype('category')\n", + "\n", + "# Train a more robust classifier with less erroneous data.\n", + "clean_labels = clean_df['cleanlab_suggested_label']\n", + "clean_data = clean_df[['exam_1','exam_2','exam_3','notes']]\n", + "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n", + "model.fit(clean_data, clean_labels)\n", + "\n", + "# Evaluate model on test split with ground truth labels.\n", + "preds = model.predict(test_data)\n", + "acc_manual = accuracy_score(preds, test_labels)\n", + "print(f\"Accuracy with original data: {round(acc_original*100, 1)}%\")\n", + "print(f\"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%\")\n", + "print(f\"Accuracy with errors manually fixed: {round(acc_manual*100, 1)}%\")\n", + "print()\n", + "\n", + "# Compute reductions in error.\n", + "clos_err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)\n", + "manual_err = ((1-acc_clean)-(1-acc_manual))/(1-acc_clean)\n", + "tot_err = ((1-acc_original)-(1-acc_manual))/(1-acc_original)\n", + "print(f\"Reduction in error using cleanlab opensource over baseline: {round(clos_err*100,1)}%\")\n", + "print(f\"Reduction in error using manual correction over opensource: {round(manual_err*100,1)}%\")\n", + "print(f\"Reduction in error using manual correction over baseline: {round(tot_err*100,1)}%\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-W-Lo82SVp7I" + }, + "source": [ + "# Conclusion\n", + "\n", + "For the student grades dataset, we found that **simply dropping identified label errors and retraining the model resulted in a 36% reduction in prediction error** on our classification problem (with accuracy improving from 79% to 87%). \n", + "\n", + "Going one step further, we manually fixed the incorrect labels, **resulting in a 70% reduction in prediction error** (with accuracy improving from 79% to 94%).\n", + "\n", + "By using open-source libraries for data-centric AI like [cleanlab](https://github.com/cleanlab/cleanlab) to ensure the integrity of your data, you can mitigate costly labeling errors and boost the performance of your models.\n", + "\n", + "[Cleanlab GitHub](https://github.com/cleanlab/cleanlab)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}