Update 01 exercise with solution

ma595 · ma595 · commit cd17f0722051 · 2025-07-08T15:34:49.000+01:00
diff --git a/exercises/01_penguin_classification.ipynb b/exercises/01_penguin_classification.ipynb
@@ -39,6 +39,90 @@
     "from palmerpenguins import load_penguins"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div style=\"text-align: center;\">\n",
+    "  <img src=\"https://raw.githubusercontent.com/allisonhorst/palmerpenguins/c19a904462482430170bfe2c718775ddb7dbb885/man/figures/culmen_depth.png\" width=\"500\" />\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Task 1 -- Part (b): Use seaborn to plot the distribution of the penguin species in the dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# import seaborn as sns\n",
+    "# sns.pairplot(data.drop(\"year\", axis=1), hue='species')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "markdown"
+    }
+   },
+   "source": [
+    "### Task 1 -- Part (c): Apply umap to visualise the data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import umap\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "\n",
+    "# Drop rows with missing values\n",
+    "data = data.dropna()  \n",
+    "\n",
+    "# Extract features\n",
+    "penguin_data = data[\n",
+    "    [\n",
+    "        \"bill_length_mm\",\n",
+    "        \"bill_depth_mm\",\n",
+    "        \"flipper_length_mm\",\n",
+    "        \"body_mass_g\",\n",
+    "    ]\n",
+    "].values \n",
+    "scaled_penguin_data = StandardScaler().fit_transform(penguin_data)\n",
+    "\n",
+    "# Fit and transform\n",
+    "reducer = umap.UMAP(random_state=42)\n",
+    "embedding = reducer.fit_transform(scaled_penguin_data)\n",
+    "\n",
+    "colors = sns.color_palette()\n",
+    "\n",
+    "for i, (species, group) in enumerate(data.groupby(\"species\")):\n",
+    "    plt.scatter(\n",
+    "        embedding[data.species == species, 0],\n",
+    "        embedding[data.species == species, 1],\n",
+    "        label=species,\n",
+    "        color=colors[i],\n",
+    "    )\n",
+    "\n",
+    "plt.gca().set_aspect(\"equal\", \"datalim\")\n",
+    "plt.title(\"UMAP projection of the Penguin dataset\", fontsize=24)\n",
+    "plt.xlabel(\"UMAP 1\", fontsize=18)\n",
+    "plt.ylabel(\"UMAP 2\", fontsize=18)\n",
+    "plt.legend(loc=\"upper right\", fontsize=10, title=\"Species\")\n",
+    "plt.show()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -181,7 +265,7 @@
    "source": [
     "data_set = PenguinDataset(\n",
     "    input_keys=[\"bill_length_mm\", \"body_mass_g\"],\n",
-    "    target_keys=...,\n",
+    "    target_key=...,\n",
     "    train=True,\n",
     ")\n",
     "\n",
@@ -203,10 +287,7 @@
     "        <li>We must represent these data as <code>torch.Tensor</code>s. This is the fundamental data abstraction used by PyTorch; they are the PyTorch equivalent to Numpy arrays, while also providing support for GPU acceleration. See <a href=\"https://pytorch.org/tutorials/beginner/introyt/tensors_deeper_tutorial.html\">pytorch tensors documentation</a>.</li>\n",
     "        <li>The targets are tuples of strings i.e. ('Gentoo', )\n",
     "          <ul>\n",
-    "            <li>One idea is to represent as ordinal values i.e. [1] or [2] or [3]. But this implies that the class encoded by value 1 is closer to 2 than 1 is to 3. This is not desirable for categorical data. One-hot encoding avoids this by representing each species independently.<br>\n",
-    "            \"A\" — [1, 0, 0]<br>\n",
-    "            \"B\" — [0, 1, 0]<br>\n",
-    "            \"C\" — [0, 0, 1]</li>\n",
+    "            <li>One idea is to represent as categorical indices i.e.  [1] or [2] or [3]. Will this work? \n",
     "          </ul>\n",
     "        </li>\n",
     "      </ul>\n",
@@ -219,9 +300,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Task 4 -- Part (a) and (b): Applying transforms to the data\n",
+    "### Task 3 -- Part (a) and (b): Applying transforms to the data\n",
     "\n",
-    "Modify the `PenguinDataset` class above so that the tuples of numbers are converted to PyTorch `torch.Tensor` s and the string targets are converted to one-hot vectors.\n",
+    "Modify the `PenguinDataset` class above so that the tuples of numbers are converted to PyTorch `torch.Tensor` s and the string targets are converted to indices.\n",
     "\n",
     "- Begin by importing relevant PyTorch functions.\n",
     "- Complete `__len__()` and `__getitem__()` functions above.\n",
@@ -242,8 +323,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Apply the transforms we need to PenguinDataset class to convert input\n",
-    "# data and target class to tensors. See Task 4 ``TODOs`` in PenguinDataset class.\n",
+    "# Complete __len__() and __getitem__() functions\n",
+    "# See Task 4 ``TODOs`` in PenguinDataset class.\n",
     "\n",
     "# Create train_set\n",
     "\n",
@@ -298,7 +379,7 @@
    "source": [
     "### Task 5: Creating ``DataLoaders``—and why\n",
     "\n",
-    "Once we have created a ``Dataset`` object, we wrap it in a ``DataLoader``.\n",
+    "Once we have created a ``Dataset`` object, we wrap it in a ``DataLoader``. This comes with a number of useful features:\n",
     "#### Mini-batches\n",
     "The ``DataLoader`` object allows us to put our inputs and targets in **mini-batches**, which makes for more efficient training.\n",
     "- Note: rather than supplying one input-target pair to the model at a time, we supply \"mini-batches\" of these data at once (typically a small power of 2, like 16 or 32).\n",
@@ -664,7 +745,7 @@
     "        # run forward model and compute proxy probabilities over dimension 1 (columns of tensor).\n",
     "\n",
     "        # compute loss\n",
-    "        # e.g. pred = [0.2, 0.7, 0.1] and target = [0, 1, 0]\n",
+    "        # e.g. pred : Tensor([3]) and target : int\n",
     "\n",
     "        # compute gradients\n",
     "\n",
@@ -742,7 +823,58 @@
    "source": [
     "### Task 11: Visualise some results\n",
     "\n",
-    "Let's do this part together—though feel free to make a start on your own if you have completed the previous exercises."
+    "Let's do this part together—though feel free to make a start on your own if you have completed the previous exercises.\n",
+    "\n",
+    "<details>\n",
+    "<summary>Visualising results</summary>\n",
+    "\n",
+    "```python\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "quantities = [\"loss\", \"accuracy\"]\n",
+    "splits = [\"train\", \"valid\"]\n",
+    "\n",
+    "epochs_range = np.arange(1, epochs + 1)\n",
+    "\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(8, 4))\n",
+    "\n",
+    "for i, quant in enumerate(quantities):\n",
+    "    ax = axes[i]\n",
+    "    for split in splits:\n",
+    "        values = metrics[f\"{quant}_{split}\"]\n",
+    "        ax.plot(epochs_range, values, marker='o', markersize=2, label=split.capitalize())\n",
+    "    ax.set_title(quant.capitalize())\n",
+    "    ax.set_xlabel(\"Epoch\")\n",
+    "    ax.set_ylabel(quant.capitalize())\n",
+    "    ax.set_xlim(1, epochs)\n",
+    "    ax.set_ylim(0.0, 1.0)\n",
+    "    ax.legend()\n",
+    "\n",
+    "fig.tight_layout()\n",
+    "plt.show()\n",
+    "\n",
+    "\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "quantities = [\"loss\", \"accuracy\"]\n",
+    "splits = [\"train\", \"valid\"]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Task 12 -- Part (a): Confusion matrix"
    ]
   },
   {
@@ -751,34 +883,83 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n",
     "import matplotlib.pyplot as plt\n",
-    "from numpy import linspace\n",
+    "import numpy as np\n",
     "\n",
+    "class_names = sorted(data.species.unique())\n",
     "\n",
-    "quantities = [\"loss\", \"accuracy\"]\n",
-    "splits = [\"train\", \"valid\"]\n",
+    "all_preds = []\n",
+    "all_labels = []\n",
     "\n",
-    "fig, axes = plt.subplots(1, 2, figsize=(8, 4))\n",
+    "model.eval()\n",
+    "with no_grad():\n",
+    "    for batch, label in valid_loader:\n",
+    "        preds = model(batch).softmax(dim=1)\n",
+    "        all_preds.append(preds.argmax(dim=1).numpy())\n",
+    "        all_labels.append(label.numpy())\n",
+    "\n",
+    "# concatenate all predictions and labels\n",
+    "all_preds = np.concatenate(all_preds)\n",
+    "all_labels = np.concatenate(all_labels)\n",
+    "\n",
+    "cm = confusion_matrix(all_labels, all_preds, labels=[0, 1, 2])\n",
+    "cm_normalized = cm.astype(\"float\") / (cm.sum(axis=1)[:, np.newaxis] + 1e-8)\n",
+    "disp = ConfusionMatrixDisplay(\n",
+    "    confusion_matrix=cm_normalized, display_labels=class_names\n",
+    ")\n",
+    "\n",
+    "# plotting\n",
+    "fig, ax = plt.subplots(figsize=(6, 5))\n",
+    "disp.plot(ax=ax, cmap=\"Blues\", colorbar=True, values_format=\".2f\")\n",
+    "disp.ax_.set_xlabel(\"Predicted Label\")\n",
+    "disp.ax_.set_ylabel(\"True Label\")\n",
+    "plt.xticks(rotation=45)\n",
+    "plt.grid(False)  # cleaner plot\n",
+    "plt.title(\"Normalized Confusion Matrix\")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Task 12 -- Part (b): Classification report"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import classification_report\n",
+    "import pandas as pd\n",
+    "\n",
+    "\n",
+    "# class_names = ['Adelie', 'Chinstrap', 'Gentoo']\n",
+    "report = classification_report(\n",
+    "    y_true=all_labels,\n",
+    "    y_pred=all_preds,\n",
+    "    target_names=class_names,\n",
+    "    output_dict=True  # <- so we can plot it\n",
+    ")\n",
     "\n",
-    "for axis, quant in zip(axes.ravel(), quantities):\n",
-    "    for split in splits:\n",
-    "        key = f\"{quant}_{split}\"\n",
-    "        axis.plot(\n",
-    "            linspace(1, epochs, epochs),\n",
-    "            metrics[key],\n",
-    "            \"-o\",\n",
-    "            ms=1.5,\n",
-    "            label=split.capitalize(),\n",
-    "        )\n",
-    "    axis.set_ylabel(quant.capitalize(), fontsize=15)\n",
     "\n",
-    "for axis in axes.ravel():\n",
-    "    axis.legend(fontsize=15)\n",
-    "    axis.set_ylim(bottom=0.0, top=1.0)\n",
-    "    axis.set_xlim(left=1, right=epochs)\n",
-    "    axis.set_xlabel(\"Epoch\", fontsize=15)\n",
+    "# Convert the report dict to DataFrame for plotting\n",
+    "report_df = pd.DataFrame(report).transpose()\n",
+    "report_df = report_df.loc[class_names, ['precision', 'recall', 'f1-score']]\n",
     "\n",
-    "fig.tight_layout()"
+    "# Plot\n",
+    "report_df.plot(kind='bar', figsize=(8, 5))\n",
+    "plt.title(\"Per-Class Precision, Recall, and F1 Score\")\n",
+    "plt.ylim(0.0, 1.05)\n",
+    "plt.ylabel(\"Score\")\n",
+    "plt.xticks(rotation=0)\n",
+    "plt.grid(axis='y')\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
    ]
   },
   {
diff --git a/pyproject.toml b/pyproject.toml
@@ -30,6 +30,8 @@ dependencies = [
     "torch_tools @ git+https://github.com/jdenholm/TorchTools.git",
     "matplotlib",
     "numpy<2.0.0",
+    "umap-learn",
+    "seaborn"
 ]
 
 [project.optional-dependencies]
diff --git a/worked-solutions/01_penguin_classification_solutions.ipynb b/worked-solutions/01_penguin_classification_solutions.ipynb

Original file line number	Diff line number	Diff line change
`@@ -30,6 +30,8 @@ dependencies = [`
`30`	`30`	`"torch_tools @ git+https://github.com/jdenholm/TorchTools.git",`
`31`	`31`	`"matplotlib",`
`32`	`32`	`"numpy<2.0.0",`
	`33`	`+ "umap-learn",`
	`34`	`+ "seaborn"`
`33`	`35`	`]`
`34`	`36`
`35`	`37`	`[project.optional-dependencies]`