From 12b20756fbed4b7b40263e2e8b61175c178f9351 Mon Sep 17 00:00:00 2001
From: Chris Mauck <cmauck10@mit.edu>
Date: Fri, 6 Jan 2023 10:59:25 -0600
Subject: [PATCH 1/3] Handling Mislabeled Tabular Data

---
 .DS_Store                                     | Bin 0 -> 6148 bytes
 ...r_Data_to_Improve_Your_XGBoost_Model.ipynb | 657 ++++++++++++++++++
 2 files changed, 657 insertions(+)
 create mode 100644 .DS_Store
 create mode 100644 handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb

diff --git a/.DS_Store b/.DS_Store
new file mode 100644
index 0000000000000000000000000000000000000000..bc6b0f6102af0c717b43e4e208fceffe4f335b90
GIT binary patch
literal 6148
zcmeHK!Aj&n5Un0Z%^<=Y6!+NRRoU^NC|<&-Kd@Q!pv#z$puw1yNya%01G(xC`6qtQ
zzUuA~W!96h3{nNvud2GLJ9!<t(?q2DqjZm`OGE*bvDHEI58-~+IcfO_qq1WZlv7DJ
zv`hDM*%DYp26*i>-BOHoOX>ak<y66sDWMFQp?8;R>yK=F2tS37STe32lENey>m_+8
z@~ZCjUR7(Uy}Ys-tOjerpXk<%qdKlfRX-kHviIAVBAMB7a-L`7L3iWCly#h!+0Yi`
zX@-=mi@Z$DxNk;fTG`sjbp)NDGw5zjCWl7{JGwXB(azfI9qmu2onUkO=kc$>U2$KU
zC&#_QXKH2L;|!i*e3^#FXjm4eyunzrd1m7a3IoD`FtAVz_)RKUTd1d!hY<#Zfkj||
z_XiDS3_Ugu?bd;Ye;>KO&By|q?-GpKW9YGQh#rV?r9fAz{1rpFa>TXI3q3XtT{$T`
zGsf{VD}O^#c6P+IO(zvPlvWrJ20k*d<lHWw|7V}y|34OqCkzM!---d%K8;TMcqM<f
y&b=I;wHkU2W#PQW;adp|dlVy=kK!LtD~M}0fT736AtDg@5wJ8!BMkgk2L1vrJzfa_

literal 0
HcmV?d00001

diff --git a/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb
new file mode 100644
index 0000000..3b74fe3
--- /dev/null
+++ b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb
@@ -0,0 +1,657 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Handling Mislabeled Tabular Data to Improve Your XGBoost Model\n",
+        "\n",
+        "This notebook highlights data-centric AI techniques (using [cleanlab](https://github.com/cleanlab/cleanlab)) to improve the accuracy of an XGBoost classifier (reducing prediction errors by 70% on the noisy dataset considered here!). These techniques involve optimizing the dataset itself rather than altering the model's architecture or hyperparameters. As a result, it is possible to achieve further improvements in accuracy by fine-tuning the model in conjunction with the newly enhanced data. Additionally, the enhancements made to the dataset through these methods are transferable to other modeling and analytical endeavors, as opposed to being specific to a particular type of model.\n",
+        "\n",
+        "At a high level we will:\n",
+        "- Establish a baseline XGBoost model accuracy on the original data.\n",
+        "- Use cleanlab's `find_label_issues()` to highlight hundreds of mislabeled data points. \n",
+        "- Remove the data with automatically-flagged label issues from the dataset, and then retrain the exact same XGBoost model. This simple step reduces the error in model predictions by **70%!** The raw difference in accuracy values between the two XGBoost models is a whopping **23%**."
+      ],
+      "metadata": {
+        "id": "zwfWqPeA1zX0"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Setup and Data Processing\n",
+        "\n",
+        "Let's take a look at our student grades tabular dataset. The data includes three exam scores (numerical features), a written note (categorical feature with missing values), and a (noisy) letter grade (categorical label). Our aim is to train a model to classify the grade for each student based on the other features, but 20% of the grade labels in this dataset are actually incorrect.\n",
+        "\n",
+        "However, for this demonstration, we have access to the true letter grade each student should've received, which we use for evaluating both the underlying accuracy of model predictions and how well cleanlab detects which data are mislabeled. These true grades are only reserved for evaluation, they are not present in the dataset used for ML.\n",
+        "\n",
+        "In your noisily-labeled datasets, there will typically be no such ground truth, and therefore addressing label issues is even more important to facilitate proper model evaluation."
+      ],
+      "metadata": {
+        "id": "je7P55z4RwX_"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 206
+        },
+        "id": "nQVmMBQOS43j",
+        "outputId": "54a45659-ecb6-47fe-acfa-d0424027af47"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "  stud_ID  exam_1  exam_2  exam_3 notes  letter_grade  noisy_letter_grade\n",
+              "0  f48f73      53      77      93     5             2                   2\n",
+              "1  0bd4e7      81      64      80     2             1                   1\n",
+              "2  e1795d      74      88      97     5             1                   1\n",
+              "3  cb9d7a      61      94      78     5             2                   2\n",
+              "4  9acca4      48      90      91     5             2                   2"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-8d5727bd-e413-4c6b-8e5f-afc76beac2cc\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>stud_ID</th>\n",
+              "      <th>exam_1</th>\n",
+              "      <th>exam_2</th>\n",
+              "      <th>exam_3</th>\n",
+              "      <th>notes</th>\n",
+              "      <th>letter_grade</th>\n",
+              "      <th>noisy_letter_grade</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>f48f73</td>\n",
+              "      <td>53</td>\n",
+              "      <td>77</td>\n",
+              "      <td>93</td>\n",
+              "      <td>5</td>\n",
+              "      <td>2</td>\n",
+              "      <td>2</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>0bd4e7</td>\n",
+              "      <td>81</td>\n",
+              "      <td>64</td>\n",
+              "      <td>80</td>\n",
+              "      <td>2</td>\n",
+              "      <td>1</td>\n",
+              "      <td>1</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>e1795d</td>\n",
+              "      <td>74</td>\n",
+              "      <td>88</td>\n",
+              "      <td>97</td>\n",
+              "      <td>5</td>\n",
+              "      <td>1</td>\n",
+              "      <td>1</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>cb9d7a</td>\n",
+              "      <td>61</td>\n",
+              "      <td>94</td>\n",
+              "      <td>78</td>\n",
+              "      <td>5</td>\n",
+              "      <td>2</td>\n",
+              "      <td>2</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>9acca4</td>\n",
+              "      <td>48</td>\n",
+              "      <td>90</td>\n",
+              "      <td>91</td>\n",
+              "      <td>5</td>\n",
+              "      <td>2</td>\n",
+              "      <td>2</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-8d5727bd-e413-4c6b-8e5f-afc76beac2cc')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-8d5727bd-e413-4c6b-8e5f-afc76beac2cc button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-8d5727bd-e413-4c6b-8e5f-afc76beac2cc');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 4
+        }
+      ],
+      "source": [
+        "!pip install cleanlab==2.2\n",
+        "!pip install xgboost==1.7\n",
+        "\n",
+        "from cleanlab.filter import find_label_issues\n",
+        "from xgboost import XGBClassifier\n",
+        "from sklearn import preprocessing\n",
+        "from sklearn.model_selection import cross_val_predict\n",
+        "from sklearn.metrics import accuracy_score\n",
+        "import pandas as pd\n",
+        "import numpy as np\n",
+        "\n",
+        "df = pd.read_csv(\"https://s.cleanlab.ai/student-grades-demo.csv\")\n",
+        "df_c = df.copy()\n",
+        "\n",
+        "# Transform letter grades and notes to categorical numbers.\n",
+        "# Necessary for XGBoost and cleanlab.\n",
+        "df['letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['letter_grade'])\n",
+        "df['noisy_letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['noisy_letter_grade'])\n",
+        "df['notes'] = preprocessing.LabelEncoder().fit_transform(df[\"notes\"])\n",
+        "df['notes'] = df['notes'].astype('category')\n",
+        "df.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Get What We Need\n",
+        "\n",
+        "We need to obtain **out-of-sample** predicted probabilities for all of our data in order to provide the `find_label_issues()` method with the necessary input. To do this, we will use XGBoost which is commonly used with tabular data. Specifically, getting the predicted probabilities can be achieved through the use of a `XGBClassifier` model in conjunction with cross-validation, which can be implemented easily using the `cross_val_predict` function from scikit-learn.\n",
+        "\n",
+        "If our tabular data consisted solely of numerical and boolean values, we could potentially utilize a simpler model such as a nearest-neighbor or logistic regression. However, our data includes a notes column, which we will treat as a categorical feature. Fortunately, XGBoost (>v1.6) is able to handle mixed data types (numerical and categorical) by setting the `enable_categorical` parameter to `true`, thereby simplifying the modeling process.\""
+      ],
+      "metadata": {
+        "id": "dVH_iciASD9F"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Train model on noisy labels.\n",
+        "# Convert numerical notes label encoding to categorical.\n",
+        "data = df.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n",
+        "labels = df['noisy_letter_grade']\n",
+        "\n",
+        "# XGBoost(experimental) supports categorical data.\n",
+        "# Here we use default hyperparameters for simplicity.\n",
+        "# Get out-of-sample predicted probabilities and check model accuracy.\n",
+        "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n",
+        "pred_probs = cross_val_predict(model, data, labels, method='predict_proba')\n",
+        "preds = np.argmax(pred_probs, axis=1)\n",
+        "\n",
+        "acc_original = accuracy_score(preds, labels)\n",
+        "print(f\"Accuracy with original data: {round(acc_original*100,1)}%\")"
+      ],
+      "metadata": {
+        "id": "gCS19IqJsQUL",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "fd89b945-6793-4b3a-a017-7b19f8e6a29b"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Accuracy with original data: 67.4%\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Using the default hyperparameters, our cross-validated XGBoost model demonstrates an accuracy of 67.3% when predicting the noisy labels. This level of performance on such a basic task is unsatisfactory. It appears that the presence of 20% label noise is significantly disrupting the model's ability to accurately predict the labels."
+      ],
+      "metadata": {
+        "id": "lTQ_iB-JSWUl"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Find Label Issues\n",
+        "\n",
+        "In just one line of code we get a list of possible label issues - it really is that easy! Top 5 results shown below.\n",
+        "\n",
+        "Let's take a look at a few of the errors cleanlab has found. Take a look at row 2, where the student cheated on exam 1 and got grades of 0, 96, and 90 which should result in a 'D' yet was accidentally labeled as a 'B'. In row 5, the student missed homework resulting in a deduction of 10 points from the overall average, receiving exam grades of 97, 86, and 68 (averages to 83, overall 73 with the deduction) which should result in a 'C' yet was accidentally labeled as an 'A'. "
+      ],
+      "metadata": {
+        "id": "klDe2ag8SZ2T"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Returns list of indices of label issues, sorted by self_confidence.\n",
+        "issue_idx = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence')\n",
+        "# Filter original data to show some issues.\n",
+        "issues_df = df_c.iloc[issue_idx]\n",
+        "# Show a few good examples.\n",
+        "issues_df.iloc[13:18]"
+      ],
+      "metadata": {
+        "id": "SfJ83uP-Xski",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 206
+        },
+        "outputId": "95ada77a-9ff2-4505-bec4-1255ef1f171e"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "    stud_ID  exam_1  exam_2  exam_3                           notes  \\\n",
+              "23   5eef2c      90      83      51                             NaN   \n",
+              "159  b3a1a5       0      96      90      cheated on exam, gets 0pts   \n",
+              "301  4591b4      66      72      83  missed homework frequently -10   \n",
+              "71   38a6ec      88      67      74                             NaN   \n",
+              "885  f00c02      97      86      68  missed homework frequently -10   \n",
+              "\n",
+              "    letter_grade noisy_letter_grade  \n",
+              "23             C                  A  \n",
+              "159            D                  B  \n",
+              "301            D                  B  \n",
+              "71             C                  A  \n",
+              "885            C                  A  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-22ccd311-e295-49e0-af21-05e2fe7ec020\">\n",
+              "    <div class=\"colab-df-container\">\n",
+              "      <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>stud_ID</th>\n",
+              "      <th>exam_1</th>\n",
+              "      <th>exam_2</th>\n",
+              "      <th>exam_3</th>\n",
+              "      <th>notes</th>\n",
+              "      <th>letter_grade</th>\n",
+              "      <th>noisy_letter_grade</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>23</th>\n",
+              "      <td>5eef2c</td>\n",
+              "      <td>90</td>\n",
+              "      <td>83</td>\n",
+              "      <td>51</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>C</td>\n",
+              "      <td>A</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>159</th>\n",
+              "      <td>b3a1a5</td>\n",
+              "      <td>0</td>\n",
+              "      <td>96</td>\n",
+              "      <td>90</td>\n",
+              "      <td>cheated on exam, gets 0pts</td>\n",
+              "      <td>D</td>\n",
+              "      <td>B</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>301</th>\n",
+              "      <td>4591b4</td>\n",
+              "      <td>66</td>\n",
+              "      <td>72</td>\n",
+              "      <td>83</td>\n",
+              "      <td>missed homework frequently -10</td>\n",
+              "      <td>D</td>\n",
+              "      <td>B</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>71</th>\n",
+              "      <td>38a6ec</td>\n",
+              "      <td>88</td>\n",
+              "      <td>67</td>\n",
+              "      <td>74</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>C</td>\n",
+              "      <td>A</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>885</th>\n",
+              "      <td>f00c02</td>\n",
+              "      <td>97</td>\n",
+              "      <td>86</td>\n",
+              "      <td>68</td>\n",
+              "      <td>missed homework frequently -10</td>\n",
+              "      <td>C</td>\n",
+              "      <td>A</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-22ccd311-e295-49e0-af21-05e2fe7ec020')\"\n",
+              "              title=\"Convert this dataframe to an interactive table.\"\n",
+              "              style=\"display:none;\">\n",
+              "        \n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "       width=\"24px\">\n",
+              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
+              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
+              "  </svg>\n",
+              "      </button>\n",
+              "      \n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      flex-wrap:wrap;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "      <script>\n",
+              "        const buttonEl =\n",
+              "          document.querySelector('#df-22ccd311-e295-49e0-af21-05e2fe7ec020 button.colab-df-convert');\n",
+              "        buttonEl.style.display =\n",
+              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "        async function convertToInteractive(key) {\n",
+              "          const element = document.querySelector('#df-22ccd311-e295-49e0-af21-05e2fe7ec020');\n",
+              "          const dataTable =\n",
+              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                     [key], {});\n",
+              "          if (!dataTable) return;\n",
+              "\n",
+              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "            + ' to learn more about interactive tables.';\n",
+              "          element.innerHTML = '';\n",
+              "          dataTable['output_type'] = 'display_data';\n",
+              "          await google.colab.output.renderOutput(dataTable, element);\n",
+              "          const docLink = document.createElement('div');\n",
+              "          docLink.innerHTML = docLinkHtml;\n",
+              "          element.appendChild(docLink);\n",
+              "        }\n",
+              "      </script>\n",
+              "    </div>\n",
+              "  </div>\n",
+              "  "
+            ]
+          },
+          "metadata": {},
+          "execution_count": 6
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# How'd We Do?\n",
+        "\n",
+        "Let's go a step further and see how cleanlab did at automatically identifying which data points are mislabeled. If we take the intersection of the labels errors identified by cleanlab and the true label errors, we see that cleanlab was able to identify 80% of the label errors correctly (based on predictions from a model that is only 67% accurate). "
+      ],
+      "metadata": {
+        "id": "PrvJHkPzSq6Q"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Computing percentage of true errors identified. \n",
+        "true_error_idx = df[df.letter_grade != df.noisy_letter_grade].index.values\n",
+        "cl_acc = len(set(true_error_idx).intersection(set(issue_idx)))/len(true_error_idx)\n",
+        "print(f\"Percentage of errors found: {round(cl_acc*100,1)}%\")"
+      ],
+      "metadata": {
+        "id": "9O2a6urWc1DA",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "f88b5ce6-33f2-4ef6-e774-19013c33f0e8"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Percentage of errors found: 79.8%\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Train a More Robust Model\n",
+        "\n",
+        "Now that we have the indices of potential label errors within our data, let's remove them from our data, retrain our model, and see what improvement we can gain.\n",
+        "\n",
+        "Keep in mind our baseline model from above, trained on the original data using the `noisy_letter_grade` as the prediction label, which achieved a cross-validation accuracy of 67%.\n",
+        "\n",
+        "Let's use a very simple method to handle these label errors and just drop them entirely from the data and retrain our exact same `XGBClassifier`."
+      ],
+      "metadata": {
+        "id": "YzxXoDOqSzn-"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Remove the label errors found by cleanlab.\n",
+        "data = df.drop(issue_idx)\n",
+        "labels = data['noisy_letter_grade']\n",
+        "data = data.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n",
+        "\n",
+        "# Train a more robust classifier with less erroneous data.\n",
+        "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n",
+        "pred_probs = cross_val_predict(model, data, labels, method='predict_proba')\n",
+        "preds = np.argmax(pred_probs, axis=1)\n",
+        "\n",
+        "acc_clean = accuracy_score(preds, labels)\n",
+        "print(f\"Accuracy with original data: {round(acc_original*100, 1)}%\")\n",
+        "print(f\"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%\")\n",
+        "\n",
+        "# Compute reduction in error.\n",
+        "err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)\n",
+        "print(f\"Reduction in error: {round(err*100,1)}%\")"
+      ],
+      "metadata": {
+        "id": "FsQFmy7xgSUa",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "a33e17ab-c197-4f95-c9c1-0473c16af313"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Accuracy with original data: 67.4%\n",
+            "Accuracy with errors found by cleanlab removed: 90.1%\n",
+            "Reduction in error: 69.7%\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "After removing the suspected label issues, our model's new cross-validation accuracy is now 90%, which means we **reduced the error-rate of the model by 70%** (the original model had 67% accuracy). \n",
+        "\n",
+        "**Note: throughout this entire process we never changed any code related to model architecture/hyperparameters, training, or data preprocessing!  This improvement is strictly coming from increasing the quality of our data which leaves additional room for additional optimizations on the modeling side.**"
+      ],
+      "metadata": {
+        "id": "9J9clVf1UzQZ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Conclusion\n",
+        "\n",
+        "For the student grades dataset, we found that simply dropping identified label errors and retraining the model resulted in a 70% reduction in prediction error on our classification problem (with accuracy improving from 67% to 90%). By using cleanlab to ensure the integrity of your data, you can mitigate costly labeling errors and boost the performance of your models.\n",
+        "\n",
+        "[Cleanlab GitHub](https://github.com/cleanlab/cleanlab)"
+      ],
+      "metadata": {
+        "id": "-W-Lo82SVp7I"
+      }
+    }
+  ]
+}
\ No newline at end of file

From f19e1dbe7111e51ce4f8b2f78dd8b46a1ac845b5 Mon Sep 17 00:00:00 2001
From: Chris Mauck <38672284+cmauck10@users.noreply.github.com>
Date: Wed, 11 Jan 2023 14:21:26 -0600
Subject: [PATCH 2/3] Add colab link

---
 ..._Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb
index 3b74fe3..5944945 100644
--- a/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb
+++ b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb
@@ -19,6 +19,8 @@
       "source": [
         "# Handling Mislabeled Tabular Data to Improve Your XGBoost Model\n",
         "\n",
+        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cmauck10/towardsai-tutorials/blob/master/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb)\n",
+        "\n",
         "This notebook highlights data-centric AI techniques (using [cleanlab](https://github.com/cleanlab/cleanlab)) to improve the accuracy of an XGBoost classifier (reducing prediction errors by 70% on the noisy dataset considered here!). These techniques involve optimizing the dataset itself rather than altering the model's architecture or hyperparameters. As a result, it is possible to achieve further improvements in accuracy by fine-tuning the model in conjunction with the newly enhanced data. Additionally, the enhancements made to the dataset through these methods are transferable to other modeling and analytical endeavors, as opposed to being specific to a particular type of model.\n",
         "\n",
         "At a high level we will:\n",

From 83d1ac79e05a266d7c46c1b82c1030faee12499e Mon Sep 17 00:00:00 2001
From: Chris Mauck <38672284+cmauck10@users.noreply.github.com>
Date: Tue, 7 Feb 2023 22:04:06 -0600
Subject: [PATCH 3/3] Update methodology and copy.

---
 ...r_Data_to_Improve_Your_XGBoost_Model.ipynb | 1141 ++++++++---------
 1 file changed, 500 insertions(+), 641 deletions(-)

diff --git a/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb
index 5944945..32529fe 100644
--- a/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb
+++ b/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb
@@ -1,659 +1,518 @@
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "zwfWqPeA1zX0"
+   },
+   "source": [
+    "# Handling Mislabeled Tabular Data to Improve Your XGBoost Model\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cmauck10/towardsai-tutorials/blob/master/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb)\n",
+    "\n",
+    "This notebook highlights data-centric AI techniques (using [cleanlab](https://github.com/cleanlab/cleanlab)) to improve the accuracy of an XGBoost classifier (reducing prediction errors by 70% on the noisy dataset considered here!). These techniques involve optimizing the dataset itself rather than altering the model's architecture or hyperparameters. As a result, it is possible to achieve further improvements in accuracy by fine-tuning the model in conjunction with the newly enhanced data. Additionally, the enhancements made to the dataset through these methods are transferable to other modeling and analytical endeavors, as opposed to being specific to a particular type of model.\n",
+    "\n",
+    "At a high level we will:\n",
+    "- Establish a baseline accuracy of XGBoost model on the original data.\n",
+    "- Use cleanlab's `find_label_issues()` to highlight hundreds of mislabeled data points. \n",
+    "- Remove the data with automatically-flagged label issues from the dataset, and then retrain the exact same XGBoost model. This simple step **reduces the error in model predictions by 36%!** The raw difference in accuracy values between the two XGBoost models **8%**.\n",
+    "- Manually correct the label issues of all examples found by `find_label_issues()`, which **reduces the error in model predictions by 70%** from the baseline, identical XGBoost model!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "je7P55z4RwX_"
+   },
+   "source": [
+    "## Setup and Data Processing\n",
+    "\n",
+    "Let's take a look at our student grades tabular dataset. The data includes three exam scores (numerical features), a written note (categorical feature with missing values), and a (noisy) letter grade (categorical label). Our aim is to train a model to classify the grade for each student based on the other features, but 20% of the grade labels in this dataset are actually incorrect.\n",
+    "\n",
+    "However, for this demonstration, we have access to the true letter grade each student should've received, which we use for evaluating both the underlying accuracy of model predictions and how well cleanlab detects which data are mislabeled. These true grades are only reserved for evaluation, they are not present in the dataset used for ML.\n",
+    "\n",
+    "In your noisily-labeled datasets, there will typically be no such ground truth, and therefore addressing label issues is even more important to facilitate proper model evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
     "colab": {
-      "provenance": []
+     "base_uri": "https://localhost:8080/",
+     "height": 206
     },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    }
+    "id": "nQVmMBQOS43j",
+    "outputId": "54a45659-ecb6-47fe-acfa-d0424027af47"
+   },
+   "outputs": [],
+   "source": [
+    "# !pip install cleanlab==2.2\n",
+    "# !pip install xgboost==1.7\n",
+    "\n",
+    "from cleanlab.filter import find_label_issues\n",
+    "from xgboost import XGBClassifier\n",
+    "from sklearn import preprocessing\n",
+    "from sklearn.model_selection import cross_val_predict\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "df = pd.read_csv(\"https://s.cleanlab.ai/student-grades-demo.csv\")\n",
+    "df_c = df.copy()\n",
+    "\n",
+    "# Transform letter grades and notes to categorical numbers.\n",
+    "# Necessary for XGBoost and cleanlab.\n",
+    "df['letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['letter_grade'])\n",
+    "df['noisy_letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['noisy_letter_grade'])\n",
+    "df['notes'] = preprocessing.LabelEncoder().fit_transform(df[\"notes\"])\n",
+    "df['notes'] = df['notes'].astype('category')\n",
+    "\n",
+    "# Split data for evaluation and set test data.\n",
+    "df_train, df_test = train_test_split(df, random_state=0)\n",
+    "df_train.reset_index(drop=True, inplace=True)\n",
+    "df_test.reset_index(drop=True, inplace=True)\n",
+    "test_data = df_test.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n",
+    "test_labels = df_test['letter_grade']"
+   ]
   },
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "source": [
-        "# Handling Mislabeled Tabular Data to Improve Your XGBoost Model\n",
-        "\n",
-        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cmauck10/towardsai-tutorials/blob/master/handling-mislabeled-tabular-data/Handling_Mislabeled_Tabular_Data_to_Improve_Your_XGBoost_Model.ipynb)\n",
-        "\n",
-        "This notebook highlights data-centric AI techniques (using [cleanlab](https://github.com/cleanlab/cleanlab)) to improve the accuracy of an XGBoost classifier (reducing prediction errors by 70% on the noisy dataset considered here!). These techniques involve optimizing the dataset itself rather than altering the model's architecture or hyperparameters. As a result, it is possible to achieve further improvements in accuracy by fine-tuning the model in conjunction with the newly enhanced data. Additionally, the enhancements made to the dataset through these methods are transferable to other modeling and analytical endeavors, as opposed to being specific to a particular type of model.\n",
-        "\n",
-        "At a high level we will:\n",
-        "- Establish a baseline XGBoost model accuracy on the original data.\n",
-        "- Use cleanlab's `find_label_issues()` to highlight hundreds of mislabeled data points. \n",
-        "- Remove the data with automatically-flagged label issues from the dataset, and then retrain the exact same XGBoost model. This simple step reduces the error in model predictions by **70%!** The raw difference in accuracy values between the two XGBoost models is a whopping **23%**."
-      ],
-      "metadata": {
-        "id": "zwfWqPeA1zX0"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Setup and Data Processing\n",
-        "\n",
-        "Let's take a look at our student grades tabular dataset. The data includes three exam scores (numerical features), a written note (categorical feature with missing values), and a (noisy) letter grade (categorical label). Our aim is to train a model to classify the grade for each student based on the other features, but 20% of the grade labels in this dataset are actually incorrect.\n",
-        "\n",
-        "However, for this demonstration, we have access to the true letter grade each student should've received, which we use for evaluating both the underlying accuracy of model predictions and how well cleanlab detects which data are mislabeled. These true grades are only reserved for evaluation, they are not present in the dataset used for ML.\n",
-        "\n",
-        "In your noisily-labeled datasets, there will typically be no such ground truth, and therefore addressing label issues is even more important to facilitate proper model evaluation."
-      ],
-      "metadata": {
-        "id": "je7P55z4RwX_"
-      }
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/",
-          "height": 206
-        },
-        "id": "nQVmMBQOS43j",
-        "outputId": "54a45659-ecb6-47fe-acfa-d0424027af47"
-      },
-      "outputs": [
-        {
-          "output_type": "execute_result",
-          "data": {
-            "text/plain": [
-              "  stud_ID  exam_1  exam_2  exam_3 notes  letter_grade  noisy_letter_grade\n",
-              "0  f48f73      53      77      93     5             2                   2\n",
-              "1  0bd4e7      81      64      80     2             1                   1\n",
-              "2  e1795d      74      88      97     5             1                   1\n",
-              "3  cb9d7a      61      94      78     5             2                   2\n",
-              "4  9acca4      48      90      91     5             2                   2"
-            ],
-            "text/html": [
-              "\n",
-              "  <div id=\"df-8d5727bd-e413-4c6b-8e5f-afc76beac2cc\">\n",
-              "    <div class=\"colab-df-container\">\n",
-              "      <div>\n",
-              "<style scoped>\n",
-              "    .dataframe tbody tr th:only-of-type {\n",
-              "        vertical-align: middle;\n",
-              "    }\n",
-              "\n",
-              "    .dataframe tbody tr th {\n",
-              "        vertical-align: top;\n",
-              "    }\n",
-              "\n",
-              "    .dataframe thead th {\n",
-              "        text-align: right;\n",
-              "    }\n",
-              "</style>\n",
-              "<table border=\"1\" class=\"dataframe\">\n",
-              "  <thead>\n",
-              "    <tr style=\"text-align: right;\">\n",
-              "      <th></th>\n",
-              "      <th>stud_ID</th>\n",
-              "      <th>exam_1</th>\n",
-              "      <th>exam_2</th>\n",
-              "      <th>exam_3</th>\n",
-              "      <th>notes</th>\n",
-              "      <th>letter_grade</th>\n",
-              "      <th>noisy_letter_grade</th>\n",
-              "    </tr>\n",
-              "  </thead>\n",
-              "  <tbody>\n",
-              "    <tr>\n",
-              "      <th>0</th>\n",
-              "      <td>f48f73</td>\n",
-              "      <td>53</td>\n",
-              "      <td>77</td>\n",
-              "      <td>93</td>\n",
-              "      <td>5</td>\n",
-              "      <td>2</td>\n",
-              "      <td>2</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>1</th>\n",
-              "      <td>0bd4e7</td>\n",
-              "      <td>81</td>\n",
-              "      <td>64</td>\n",
-              "      <td>80</td>\n",
-              "      <td>2</td>\n",
-              "      <td>1</td>\n",
-              "      <td>1</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>2</th>\n",
-              "      <td>e1795d</td>\n",
-              "      <td>74</td>\n",
-              "      <td>88</td>\n",
-              "      <td>97</td>\n",
-              "      <td>5</td>\n",
-              "      <td>1</td>\n",
-              "      <td>1</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>3</th>\n",
-              "      <td>cb9d7a</td>\n",
-              "      <td>61</td>\n",
-              "      <td>94</td>\n",
-              "      <td>78</td>\n",
-              "      <td>5</td>\n",
-              "      <td>2</td>\n",
-              "      <td>2</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>4</th>\n",
-              "      <td>9acca4</td>\n",
-              "      <td>48</td>\n",
-              "      <td>90</td>\n",
-              "      <td>91</td>\n",
-              "      <td>5</td>\n",
-              "      <td>2</td>\n",
-              "      <td>2</td>\n",
-              "    </tr>\n",
-              "  </tbody>\n",
-              "</table>\n",
-              "</div>\n",
-              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-8d5727bd-e413-4c6b-8e5f-afc76beac2cc')\"\n",
-              "              title=\"Convert this dataframe to an interactive table.\"\n",
-              "              style=\"display:none;\">\n",
-              "        \n",
-              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
-              "       width=\"24px\">\n",
-              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
-              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
-              "  </svg>\n",
-              "      </button>\n",
-              "      \n",
-              "  <style>\n",
-              "    .colab-df-container {\n",
-              "      display:flex;\n",
-              "      flex-wrap:wrap;\n",
-              "      gap: 12px;\n",
-              "    }\n",
-              "\n",
-              "    .colab-df-convert {\n",
-              "      background-color: #E8F0FE;\n",
-              "      border: none;\n",
-              "      border-radius: 50%;\n",
-              "      cursor: pointer;\n",
-              "      display: none;\n",
-              "      fill: #1967D2;\n",
-              "      height: 32px;\n",
-              "      padding: 0 0 0 0;\n",
-              "      width: 32px;\n",
-              "    }\n",
-              "\n",
-              "    .colab-df-convert:hover {\n",
-              "      background-color: #E2EBFA;\n",
-              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
-              "      fill: #174EA6;\n",
-              "    }\n",
-              "\n",
-              "    [theme=dark] .colab-df-convert {\n",
-              "      background-color: #3B4455;\n",
-              "      fill: #D2E3FC;\n",
-              "    }\n",
-              "\n",
-              "    [theme=dark] .colab-df-convert:hover {\n",
-              "      background-color: #434B5C;\n",
-              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
-              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
-              "      fill: #FFFFFF;\n",
-              "    }\n",
-              "  </style>\n",
-              "\n",
-              "      <script>\n",
-              "        const buttonEl =\n",
-              "          document.querySelector('#df-8d5727bd-e413-4c6b-8e5f-afc76beac2cc button.colab-df-convert');\n",
-              "        buttonEl.style.display =\n",
-              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
-              "\n",
-              "        async function convertToInteractive(key) {\n",
-              "          const element = document.querySelector('#df-8d5727bd-e413-4c6b-8e5f-afc76beac2cc');\n",
-              "          const dataTable =\n",
-              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
-              "                                                     [key], {});\n",
-              "          if (!dataTable) return;\n",
-              "\n",
-              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
-              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
-              "            + ' to learn more about interactive tables.';\n",
-              "          element.innerHTML = '';\n",
-              "          dataTable['output_type'] = 'display_data';\n",
-              "          await google.colab.output.renderOutput(dataTable, element);\n",
-              "          const docLink = document.createElement('div');\n",
-              "          docLink.innerHTML = docLinkHtml;\n",
-              "          element.appendChild(docLink);\n",
-              "        }\n",
-              "      </script>\n",
-              "    </div>\n",
-              "  </div>\n",
-              "  "
-            ]
-          },
-          "metadata": {},
-          "execution_count": 4
-        }
-      ],
-      "source": [
-        "!pip install cleanlab==2.2\n",
-        "!pip install xgboost==1.7\n",
-        "\n",
-        "from cleanlab.filter import find_label_issues\n",
-        "from xgboost import XGBClassifier\n",
-        "from sklearn import preprocessing\n",
-        "from sklearn.model_selection import cross_val_predict\n",
-        "from sklearn.metrics import accuracy_score\n",
-        "import pandas as pd\n",
-        "import numpy as np\n",
-        "\n",
-        "df = pd.read_csv(\"https://s.cleanlab.ai/student-grades-demo.csv\")\n",
-        "df_c = df.copy()\n",
-        "\n",
-        "# Transform letter grades and notes to categorical numbers.\n",
-        "# Necessary for XGBoost and cleanlab.\n",
-        "df['letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['letter_grade'])\n",
-        "df['noisy_letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['noisy_letter_grade'])\n",
-        "df['notes'] = preprocessing.LabelEncoder().fit_transform(df[\"notes\"])\n",
-        "df['notes'] = df['notes'].astype('category')\n",
-        "df.head()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "# Get What We Need\n",
-        "\n",
-        "We need to obtain **out-of-sample** predicted probabilities for all of our data in order to provide the `find_label_issues()` method with the necessary input. To do this, we will use XGBoost which is commonly used with tabular data. Specifically, getting the predicted probabilities can be achieved through the use of a `XGBClassifier` model in conjunction with cross-validation, which can be implemented easily using the `cross_val_predict` function from scikit-learn.\n",
-        "\n",
-        "If our tabular data consisted solely of numerical and boolean values, we could potentially utilize a simpler model such as a nearest-neighbor or logistic regression. However, our data includes a notes column, which we will treat as a categorical feature. Fortunately, XGBoost (>v1.6) is able to handle mixed data types (numerical and categorical) by setting the `enable_categorical` parameter to `true`, thereby simplifying the modeling process.\""
-      ],
-      "metadata": {
-        "id": "dVH_iciASD9F"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "# Train model on noisy labels.\n",
-        "# Convert numerical notes label encoding to categorical.\n",
-        "data = df.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n",
-        "labels = df['noisy_letter_grade']\n",
-        "\n",
-        "# XGBoost(experimental) supports categorical data.\n",
-        "# Here we use default hyperparameters for simplicity.\n",
-        "# Get out-of-sample predicted probabilities and check model accuracy.\n",
-        "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n",
-        "pred_probs = cross_val_predict(model, data, labels, method='predict_proba')\n",
-        "preds = np.argmax(pred_probs, axis=1)\n",
-        "\n",
-        "acc_original = accuracy_score(preds, labels)\n",
-        "print(f\"Accuracy with original data: {round(acc_original*100,1)}%\")"
-      ],
-      "metadata": {
-        "id": "gCS19IqJsQUL",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "fd89b945-6793-4b3a-a017-7b19f8e6a29b"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Accuracy with original data: 67.4%\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Using the default hyperparameters, our cross-validated XGBoost model demonstrates an accuracy of 67.3% when predicting the noisy labels. This level of performance on such a basic task is unsatisfactory. It appears that the presence of 20% label noise is significantly disrupting the model's ability to accurately predict the labels."
-      ],
-      "metadata": {
-        "id": "lTQ_iB-JSWUl"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "# Find Label Issues\n",
-        "\n",
-        "In just one line of code we get a list of possible label issues - it really is that easy! Top 5 results shown below.\n",
-        "\n",
-        "Let's take a look at a few of the errors cleanlab has found. Take a look at row 2, where the student cheated on exam 1 and got grades of 0, 96, and 90 which should result in a 'D' yet was accidentally labeled as a 'B'. In row 5, the student missed homework resulting in a deduction of 10 points from the overall average, receiving exam grades of 97, 86, and 68 (averages to 83, overall 73 with the deduction) which should result in a 'C' yet was accidentally labeled as an 'A'. "
-      ],
-      "metadata": {
-        "id": "klDe2ag8SZ2T"
-      }
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "dVH_iciASD9F"
+   },
+   "source": [
+    "# Train and Evaluate XGBoost Classifier\n",
+    "\n",
+    "Now that we’ve seen what can be achieved with cleanlab, let’s take a look at how we get there.\n",
+    "\n",
+    "For our model of choice, we will use XGBoost, an implementation of gradient-boosting decision trees (GBDT), which are commonly used with tabular data. If our tabular data consisted solely of numerical and boolean values, we could potentially utilize a simpler model such as a nearest-neighbor or logistic regression. However, our data includes a notes column, which we will treat as a categorical feature. Fortunately, XGBoost (>v1.6) is able to handle mixed data types (numerical and categorical) by setting the `enable_categorical` parameter to `true`, thereby simplifying the modeling process."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
     },
+    "id": "gCS19IqJsQUL",
+    "outputId": "fd89b945-6793-4b3a-a017-7b19f8e6a29b"
+   },
+   "outputs": [
     {
-      "cell_type": "code",
-      "source": [
-        "# Returns list of indices of label issues, sorted by self_confidence.\n",
-        "issue_idx = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence')\n",
-        "# Filter original data to show some issues.\n",
-        "issues_df = df_c.iloc[issue_idx]\n",
-        "# Show a few good examples.\n",
-        "issues_df.iloc[13:18]"
-      ],
-      "metadata": {
-        "id": "SfJ83uP-Xski",
-        "colab": {
-          "base_uri": "https://localhost:8080/",
-          "height": 206
-        },
-        "outputId": "95ada77a-9ff2-4505-bec4-1255ef1f171e"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "execute_result",
-          "data": {
-            "text/plain": [
-              "    stud_ID  exam_1  exam_2  exam_3                           notes  \\\n",
-              "23   5eef2c      90      83      51                             NaN   \n",
-              "159  b3a1a5       0      96      90      cheated on exam, gets 0pts   \n",
-              "301  4591b4      66      72      83  missed homework frequently -10   \n",
-              "71   38a6ec      88      67      74                             NaN   \n",
-              "885  f00c02      97      86      68  missed homework frequently -10   \n",
-              "\n",
-              "    letter_grade noisy_letter_grade  \n",
-              "23             C                  A  \n",
-              "159            D                  B  \n",
-              "301            D                  B  \n",
-              "71             C                  A  \n",
-              "885            C                  A  "
-            ],
-            "text/html": [
-              "\n",
-              "  <div id=\"df-22ccd311-e295-49e0-af21-05e2fe7ec020\">\n",
-              "    <div class=\"colab-df-container\">\n",
-              "      <div>\n",
-              "<style scoped>\n",
-              "    .dataframe tbody tr th:only-of-type {\n",
-              "        vertical-align: middle;\n",
-              "    }\n",
-              "\n",
-              "    .dataframe tbody tr th {\n",
-              "        vertical-align: top;\n",
-              "    }\n",
-              "\n",
-              "    .dataframe thead th {\n",
-              "        text-align: right;\n",
-              "    }\n",
-              "</style>\n",
-              "<table border=\"1\" class=\"dataframe\">\n",
-              "  <thead>\n",
-              "    <tr style=\"text-align: right;\">\n",
-              "      <th></th>\n",
-              "      <th>stud_ID</th>\n",
-              "      <th>exam_1</th>\n",
-              "      <th>exam_2</th>\n",
-              "      <th>exam_3</th>\n",
-              "      <th>notes</th>\n",
-              "      <th>letter_grade</th>\n",
-              "      <th>noisy_letter_grade</th>\n",
-              "    </tr>\n",
-              "  </thead>\n",
-              "  <tbody>\n",
-              "    <tr>\n",
-              "      <th>23</th>\n",
-              "      <td>5eef2c</td>\n",
-              "      <td>90</td>\n",
-              "      <td>83</td>\n",
-              "      <td>51</td>\n",
-              "      <td>NaN</td>\n",
-              "      <td>C</td>\n",
-              "      <td>A</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>159</th>\n",
-              "      <td>b3a1a5</td>\n",
-              "      <td>0</td>\n",
-              "      <td>96</td>\n",
-              "      <td>90</td>\n",
-              "      <td>cheated on exam, gets 0pts</td>\n",
-              "      <td>D</td>\n",
-              "      <td>B</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>301</th>\n",
-              "      <td>4591b4</td>\n",
-              "      <td>66</td>\n",
-              "      <td>72</td>\n",
-              "      <td>83</td>\n",
-              "      <td>missed homework frequently -10</td>\n",
-              "      <td>D</td>\n",
-              "      <td>B</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>71</th>\n",
-              "      <td>38a6ec</td>\n",
-              "      <td>88</td>\n",
-              "      <td>67</td>\n",
-              "      <td>74</td>\n",
-              "      <td>NaN</td>\n",
-              "      <td>C</td>\n",
-              "      <td>A</td>\n",
-              "    </tr>\n",
-              "    <tr>\n",
-              "      <th>885</th>\n",
-              "      <td>f00c02</td>\n",
-              "      <td>97</td>\n",
-              "      <td>86</td>\n",
-              "      <td>68</td>\n",
-              "      <td>missed homework frequently -10</td>\n",
-              "      <td>C</td>\n",
-              "      <td>A</td>\n",
-              "    </tr>\n",
-              "  </tbody>\n",
-              "</table>\n",
-              "</div>\n",
-              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-22ccd311-e295-49e0-af21-05e2fe7ec020')\"\n",
-              "              title=\"Convert this dataframe to an interactive table.\"\n",
-              "              style=\"display:none;\">\n",
-              "        \n",
-              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
-              "       width=\"24px\">\n",
-              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
-              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
-              "  </svg>\n",
-              "      </button>\n",
-              "      \n",
-              "  <style>\n",
-              "    .colab-df-container {\n",
-              "      display:flex;\n",
-              "      flex-wrap:wrap;\n",
-              "      gap: 12px;\n",
-              "    }\n",
-              "\n",
-              "    .colab-df-convert {\n",
-              "      background-color: #E8F0FE;\n",
-              "      border: none;\n",
-              "      border-radius: 50%;\n",
-              "      cursor: pointer;\n",
-              "      display: none;\n",
-              "      fill: #1967D2;\n",
-              "      height: 32px;\n",
-              "      padding: 0 0 0 0;\n",
-              "      width: 32px;\n",
-              "    }\n",
-              "\n",
-              "    .colab-df-convert:hover {\n",
-              "      background-color: #E2EBFA;\n",
-              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
-              "      fill: #174EA6;\n",
-              "    }\n",
-              "\n",
-              "    [theme=dark] .colab-df-convert {\n",
-              "      background-color: #3B4455;\n",
-              "      fill: #D2E3FC;\n",
-              "    }\n",
-              "\n",
-              "    [theme=dark] .colab-df-convert:hover {\n",
-              "      background-color: #434B5C;\n",
-              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
-              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
-              "      fill: #FFFFFF;\n",
-              "    }\n",
-              "  </style>\n",
-              "\n",
-              "      <script>\n",
-              "        const buttonEl =\n",
-              "          document.querySelector('#df-22ccd311-e295-49e0-af21-05e2fe7ec020 button.colab-df-convert');\n",
-              "        buttonEl.style.display =\n",
-              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
-              "\n",
-              "        async function convertToInteractive(key) {\n",
-              "          const element = document.querySelector('#df-22ccd311-e295-49e0-af21-05e2fe7ec020');\n",
-              "          const dataTable =\n",
-              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
-              "                                                     [key], {});\n",
-              "          if (!dataTable) return;\n",
-              "\n",
-              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
-              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
-              "            + ' to learn more about interactive tables.';\n",
-              "          element.innerHTML = '';\n",
-              "          dataTable['output_type'] = 'display_data';\n",
-              "          await google.colab.output.renderOutput(dataTable, element);\n",
-              "          const docLink = document.createElement('div');\n",
-              "          docLink.innerHTML = docLinkHtml;\n",
-              "          element.appendChild(docLink);\n",
-              "        }\n",
-              "      </script>\n",
-              "    </div>\n",
-              "  </div>\n",
-              "  "
-            ]
-          },
-          "metadata": {},
-          "execution_count": 6
-        }
-      ]
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Accuracy with original data: 79.2%\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Train model on noisy labels.\n",
+    "train_data = df_train.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n",
+    "train_labels = df_train['noisy_letter_grade']\n",
+    "\n",
+    "# XGBoost(experimental) supports categorical data.\n",
+    "# Here we use default hyperparameters for simplicity.\n",
+    "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n",
+    "model.fit(train_data, train_labels)\n",
+    "\n",
+    "# Evaluate model on test split with ground truth labels.\n",
+    "preds = model.predict(test_data)\n",
+    "acc_original = accuracy_score(preds, test_labels)\n",
+    "print(f\"Accuracy with original data: {round(acc_original*100,1)}%\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "lTQ_iB-JSWUl"
+   },
+   "source": [
+    "Using the default hyperparameters, our baseline XGBoost model demonstrates an accuracy of 79.2% when trained on the noisy labels and predicting the test set. It appears that the presence of 20% label noise is significantly disrupting the model’s ability to accurately predict the labels on such a trivial task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "klDe2ag8SZ2T"
+   },
+   "source": [
+    "# Find Label Issues\n",
+    "\n",
+    "In order to use cleanlab, we need to obtain **out-of-sample** predicted probabilities for all of our training data in order to provide the `find_label_issues()` method with the necessary input. Getting the predicted probabilities can be achieved through the use of our `XGBClassifier` model with cross-validation, which can be implemented easily using the `cross_val_predict` function from scikit-learn.\n",
+    "\n",
+    "In just a few lines of code, we get a list of possible label issues! A few of the top results are shown below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 206
     },
+    "id": "SfJ83uP-Xski",
+    "outputId": "95ada77a-9ff2-4505-bec4-1255ef1f171e"
+   },
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "source": [
-        "# How'd We Do?\n",
-        "\n",
-        "Let's go a step further and see how cleanlab did at automatically identifying which data points are mislabeled. If we take the intersection of the labels errors identified by cleanlab and the true label errors, we see that cleanlab was able to identify 80% of the label errors correctly (based on predictions from a model that is only 67% accurate). "
-      ],
-      "metadata": {
-        "id": "PrvJHkPzSq6Q"
-      }
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2023-02-07 22:01:11.846308: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA\n",
+      "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "/var/folders/0l/js52t50n2d71fm_0mhtlwx880000gn/T/ipykernel_37207/2155506160.py:11: SettingWithCopyWarning: \n",
+      "A value is trying to be set on a copy of a slice from a DataFrame\n",
+      "\n",
+      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+      "  issues_df.sort_values(by=\"stud_ID\", key=lambda column: column.map(lambda e: issue_stud_id.index(e)), inplace=True)\n"
+     ]
     },
     {
-      "cell_type": "code",
-      "source": [
-        "# Computing percentage of true errors identified. \n",
-        "true_error_idx = df[df.letter_grade != df.noisy_letter_grade].index.values\n",
-        "cl_acc = len(set(true_error_idx).intersection(set(issue_idx)))/len(true_error_idx)\n",
-        "print(f\"Percentage of errors found: {round(cl_acc*100,1)}%\")"
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>stud_ID</th>\n",
+       "      <th>exam_1</th>\n",
+       "      <th>exam_2</th>\n",
+       "      <th>exam_3</th>\n",
+       "      <th>notes</th>\n",
+       "      <th>letter_grade</th>\n",
+       "      <th>noisy_letter_grade</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>765</th>\n",
+       "      <td>75ce98</td>\n",
+       "      <td>91</td>\n",
+       "      <td>89</td>\n",
+       "      <td>81</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>B</td>\n",
+       "      <td>F</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>744</th>\n",
+       "      <td>3d0fdf</td>\n",
+       "      <td>90</td>\n",
+       "      <td>74</td>\n",
+       "      <td>95</td>\n",
+       "      <td>great participation +10</td>\n",
+       "      <td>A</td>\n",
+       "      <td>F</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>637</th>\n",
+       "      <td>77c9c5</td>\n",
+       "      <td>0</td>\n",
+       "      <td>79</td>\n",
+       "      <td>65</td>\n",
+       "      <td>cheated on exam, gets 0pts</td>\n",
+       "      <td>F</td>\n",
+       "      <td>A</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>404</th>\n",
+       "      <td>bb13f4</td>\n",
+       "      <td>65</td>\n",
+       "      <td>95</td>\n",
+       "      <td>68</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>C</td>\n",
+       "      <td>A</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>217</th>\n",
+       "      <td>3c4cbb</td>\n",
+       "      <td>94</td>\n",
+       "      <td>62</td>\n",
+       "      <td>66</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>C</td>\n",
+       "      <td>C</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
       ],
-      "metadata": {
-        "id": "9O2a6urWc1DA",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "f88b5ce6-33f2-4ef6-e774-19013c33f0e8"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Percentage of errors found: 79.8%\n"
-          ]
-        }
+      "text/plain": [
+       "    stud_ID  exam_1  exam_2  exam_3                       notes letter_grade  \\\n",
+       "765  75ce98      91      89      81                         NaN            B   \n",
+       "744  3d0fdf      90      74      95     great participation +10            A   \n",
+       "637  77c9c5       0      79      65  cheated on exam, gets 0pts            F   \n",
+       "404  bb13f4      65      95      68                         NaN            C   \n",
+       "217  3c4cbb      94      62      66                         NaN            C   \n",
+       "\n",
+       "    noisy_letter_grade  \n",
+       "765                  F  \n",
+       "744                  F  \n",
+       "637                  A  \n",
+       "404                  A  \n",
+       "217                  C  "
       ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Get predicted probabilities through cross validation.\n",
+    "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n",
+    "pred_probs = cross_val_predict(model, train_data, train_labels, method='predict_proba')\n",
+    "\n",
+    "# Returns list of indices of label issues, sorted by self_confidence.\n",
+    "issue_idx = find_label_issues(train_labels, pred_probs, return_indices_ranked_by='self_confidence')\n",
+    "\n",
+    "# Filter original data to show students with grade issues.\n",
+    "issue_stud_id = df_train.iloc[issue_idx].stud_ID.values.tolist()\n",
+    "issues_df = df_c[df_c['stud_ID'].isin(issue_stud_id)]\n",
+    "issues_df.sort_values(by=\"stud_ID\", key=lambda column: column.map(lambda e: issue_stud_id.index(e)), inplace=True)\n",
+    "\n",
+    "# Show a few good examples.\n",
+    "issues_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let’s take a look at a few of the label issues automatically identified in our dataset. Take a look at row 1, where the student got grades of 91, 89, and 81, which should result in a ‘B’ yet was accidentally labeled as an ‘F’. In row 2, the student had great participation resulting in an addition of 10 points to the overall average, receiving exam grades of 90, 74, and 95 (averages to 86.3, overall 96.3 with the bonus), which should result in a ‘A’ yet was accidentally labeled as an ‘F’.\n",
+    "\n",
+    "**Note: `find_label_issues` is able to determine that the given label is incorrect, without ever seeing the ground truth label `letter_grade`.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "PrvJHkPzSq6Q"
+   },
+   "source": [
+    "# How'd We Do?\n",
+    "\n",
+    "Let's go a step further and see how cleanlab did at automatically identifying which data points are mislabeled. If we take the intersection of the label errors identified by cleanlab and the true label errors, we see that cleanlab was able to identify 83% of the label errors correctly (based on predictions from a model that is only 79% accurate). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
     },
+    "id": "9O2a6urWc1DA",
+    "outputId": "f88b5ce6-33f2-4ef6-e774-19013c33f0e8"
+   },
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "source": [
-        "# Train a More Robust Model\n",
-        "\n",
-        "Now that we have the indices of potential label errors within our data, let's remove them from our data, retrain our model, and see what improvement we can gain.\n",
-        "\n",
-        "Keep in mind our baseline model from above, trained on the original data using the `noisy_letter_grade` as the prediction label, which achieved a cross-validation accuracy of 67%.\n",
-        "\n",
-        "Let's use a very simple method to handle these label errors and just drop them entirely from the data and retrain our exact same `XGBClassifier`."
-      ],
-      "metadata": {
-        "id": "YzxXoDOqSzn-"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "# Remove the label errors found by cleanlab.\n",
-        "data = df.drop(issue_idx)\n",
-        "labels = data['noisy_letter_grade']\n",
-        "data = data.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n",
-        "\n",
-        "# Train a more robust classifier with less erroneous data.\n",
-        "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n",
-        "pred_probs = cross_val_predict(model, data, labels, method='predict_proba')\n",
-        "preds = np.argmax(pred_probs, axis=1)\n",
-        "\n",
-        "acc_clean = accuracy_score(preds, labels)\n",
-        "print(f\"Accuracy with original data: {round(acc_original*100, 1)}%\")\n",
-        "print(f\"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%\")\n",
-        "\n",
-        "# Compute reduction in error.\n",
-        "err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)\n",
-        "print(f\"Reduction in error: {round(err*100,1)}%\")"
-      ],
-      "metadata": {
-        "id": "FsQFmy7xgSUa",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "a33e17ab-c197-4f95-c9c1-0473c16af313"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Accuracy with original data: 67.4%\n",
-            "Accuracy with errors found by cleanlab removed: 90.1%\n",
-            "Reduction in error: 69.7%\n"
-          ]
-        }
-      ]
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Percentage of errors found: 82.9%\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Computing percentage of true errors identified. \n",
+    "true_error_idx = df_train[df_train.letter_grade != df_train.noisy_letter_grade].index.values\n",
+    "cl_acc = len(set(true_error_idx).intersection(set(issue_idx)))/len(true_error_idx)\n",
+    "print(f\"Percentage of errors found: {round(cl_acc*100,1)}%\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "YzxXoDOqSzn-"
+   },
+   "source": [
+    "# Retraining a More Robust Model\n",
+    "\n",
+    "Now that we have the indices of potential label errors within our data, let's remove them from our data, retrain our model, and see what improvement we can gain.\n",
+    "\n",
+    "Keep in mind our baseline model from above, trained on the original data using the `noisy_letter_grade` as the prediction label, which achieved a cross-validation accuracy of 79%.\n",
+    "\n",
+    "Let's use a very simple method to handle these label errors and just drop them entirely from the data and retrain our exact same `XGBClassifier`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
     },
+    "id": "FsQFmy7xgSUa",
+    "outputId": "a33e17ab-c197-4f95-c9c1-0473c16af313"
+   },
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "source": [
-        "After removing the suspected label issues, our model's new cross-validation accuracy is now 90%, which means we **reduced the error-rate of the model by 70%** (the original model had 67% accuracy). \n",
-        "\n",
-        "**Note: throughout this entire process we never changed any code related to model architecture/hyperparameters, training, or data preprocessing!  This improvement is strictly coming from increasing the quality of our data which leaves additional room for additional optimizations on the modeling side.**"
-      ],
-      "metadata": {
-        "id": "9J9clVf1UzQZ"
-      }
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Accuracy with original data: 79.2%\n",
+      "Accuracy with errors found by cleanlab removed: 86.9%\n",
+      "Reduction in error: 36.7%\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Remove the label errors found by cleanlab.\n",
+    "train_data_cl = df_train.drop(issue_idx)\n",
+    "train_labels_cl = train_data_cl['noisy_letter_grade']\n",
+    "train_data_cl = train_data_cl.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)\n",
+    "\n",
+    "# Train a more robust classifier with less erroneous data.\n",
+    "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n",
+    "model.fit(train_data_cl, train_labels_cl)\n",
+    "\n",
+    "# Evaluate model on test split with ground truth labels.\n",
+    "preds = model.predict(test_data)\n",
+    "acc_clean = accuracy_score(preds, test_labels)\n",
+    "print(f\"Accuracy with original data: {round(acc_original*100, 1)}%\")\n",
+    "print(f\"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%\")\n",
+    "\n",
+    "# Compute reduction in error.\n",
+    "err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)\n",
+    "print(f\"Reduction in error: {round(err*100,1)}%\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "9J9clVf1UzQZ"
+   },
+   "source": [
+    "After removing the suspected label issues, our model's new cross-validation accuracy is now 87%, which means we **reduced the error-rate of the model by 36%** (the original model had 79% accuracy). \n",
+    "\n",
+    "**Note: throughout this entire process we never changed any code related to model architecture/hyperparameters, training, or data preprocessing!  This improvement is strictly coming from increasing the quality of our data which leaves additional room for additional optimizations on the modeling side.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Fixing Label Errors\n",
+    "\n",
+    "Instead of just dropping the potential label issues, the smarter (yet more time-intensive) way to increase our data quality would be to correct the automatically-identified label issues by hand. This simultaneously removes a noisy data point and adds an accurate one.\n",
+    "\n",
+    "I reviewed the potential label errors identified by `find_label_issues()` and made adjustments to the labels as needed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "source": [
-        "# Conclusion\n",
-        "\n",
-        "For the student grades dataset, we found that simply dropping identified label errors and retraining the model resulted in a 70% reduction in prediction error on our classification problem (with accuracy improving from 67% to 90%). By using cleanlab to ensure the integrity of your data, you can mitigate costly labeling errors and boost the performance of your models.\n",
-        "\n",
-        "[Cleanlab GitHub](https://github.com/cleanlab/cleanlab)"
-      ],
-      "metadata": {
-        "id": "-W-Lo82SVp7I"
-      }
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Accuracy with original data: 79.2%\n",
+      "Accuracy with errors found by cleanlab removed: 86.9%\n",
+      "Accuracy with errors manually fixed: 93.6%\n",
+      "\n",
+      "Reduction in error using cleanlab opensource over baseline: 36.7%\n",
+      "Reduction in error using manual correction over opensource: 51.6%\n",
+      "Reduction in error using manual correction over baseline: 69.4%\n"
+     ]
     }
-  ]
-}
\ No newline at end of file
+   ],
+   "source": [
+    "# Get the manually corrected data and split to match training subset.\n",
+    "clean_df = pd.read_csv(\"https://s.cleanlab.ai/student-grades-demo-studio-export.csv\")\n",
+    "train_students = df_train.stud_ID.values\n",
+    "clean_df = clean_df[clean_df['stud_ID'].isin(train_students)]\n",
+    "\n",
+    "# Same pre-processing as above.\n",
+    "clean_df['cleanlab_suggested_label'] = preprocessing.LabelEncoder().fit_transform(clean_df['cleanlab_suggested_label'])\n",
+    "clean_df['notes'] = preprocessing.LabelEncoder().fit_transform(clean_df[\"notes\"])\n",
+    "clean_df['notes'] = clean_df['notes'].astype('category')\n",
+    "\n",
+    "# Train a more robust classifier with less erroneous data.\n",
+    "clean_labels = clean_df['cleanlab_suggested_label']\n",
+    "clean_data = clean_df[['exam_1','exam_2','exam_3','notes']]\n",
+    "model = XGBClassifier(tree_method=\"hist\", enable_categorical=True)\n",
+    "model.fit(clean_data, clean_labels)\n",
+    "\n",
+    "# Evaluate model on test split with ground truth labels.\n",
+    "preds = model.predict(test_data)\n",
+    "acc_manual = accuracy_score(preds, test_labels)\n",
+    "print(f\"Accuracy with original data: {round(acc_original*100, 1)}%\")\n",
+    "print(f\"Accuracy with errors found by cleanlab removed: {round(acc_clean*100, 1)}%\")\n",
+    "print(f\"Accuracy with errors manually fixed: {round(acc_manual*100, 1)}%\")\n",
+    "print()\n",
+    "\n",
+    "# Compute reductions in error.\n",
+    "clos_err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)\n",
+    "manual_err = ((1-acc_clean)-(1-acc_manual))/(1-acc_clean)\n",
+    "tot_err = ((1-acc_original)-(1-acc_manual))/(1-acc_original)\n",
+    "print(f\"Reduction in error using cleanlab opensource over baseline: {round(clos_err*100,1)}%\")\n",
+    "print(f\"Reduction in error using manual correction over opensource: {round(manual_err*100,1)}%\")\n",
+    "print(f\"Reduction in error using manual correction over baseline: {round(tot_err*100,1)}%\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "-W-Lo82SVp7I"
+   },
+   "source": [
+    "# Conclusion\n",
+    "\n",
+    "For the student grades dataset, we found that **simply dropping identified label errors and retraining the model resulted in a 36% reduction in prediction error** on our classification problem (with accuracy improving from 79% to 87%). \n",
+    "\n",
+    "Going one step further, we manually fixed the incorrect labels, **resulting in a 70% reduction in prediction error** (with accuracy improving from 79% to 94%).\n",
+    "\n",
+    "By using open-source libraries for data-centric AI like [cleanlab](https://github.com/cleanlab/cleanlab) to ensure the integrity of your data, you can mitigate costly labeling errors and boost the performance of your models.\n",
+    "\n",
+    "[Cleanlab GitHub](https://github.com/cleanlab/cleanlab)"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}