gtbook
diff --git a/‎S55_diffdrive_planning.ipynb
Lines changed: 59 additions & 61 deletions b/‎S55_diffdrive_planning.ipynb
Lines changed: 59 additions & 61 deletions
diff --git a/‎S56_diffdrive_learning.ipynb
Lines changed: 44 additions & 42 deletions b/‎S56_diffdrive_learning.ipynb
Lines changed: 44 additions & 42 deletions
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "id": "-z7-iMHZamMh",
    "metadata": {
     "tags": [
@@ -40,7 +40,7 @@
     }
    ],
    "source": [
-    "%pip install -q -U gtbook\n"
+    "%pip install -q -U gtbook"
    ]
   },
   {
@@ -114,12 +114,12 @@
    "id": "eVunmcqzSi4j",
    "metadata": {},
    "source": [
-    "```{index} supervised learning, classification, regression\n",
-    "```\n",
     "## Supervised Learning Setup\n",
     "\n",
     "> From data, learn concept.\n",
     "\n",
+    "```{index} supervised learning, classification, regression, training dataset\n",
+    "```\n",
     "In the **supervised learning** setup, we have a large number of examples of inputs $x$ and corresponding labels $y$.\n",
     "We will often refer to the *training dataset* as $D$, consisting of pairs $(x,y)$. The nature of the output labels $y$ determine the type of learning problem we are dealing with:\n",
     "\n",
@@ -133,6 +133,8 @@
    "id": "PiBqLmehLzBj",
    "metadata": {},
    "source": [
+    "```{index} training datasets, validation dataset, test dataset, overfitting\n",
+    "```\n",
     "Whether we are talking about classification or regression, the supervised leaning process normally follows these steps:\n",
     "\n",
     "1. Define a model $f$ and its parameters $\\theta$ that allow you to output a prediction $\\hat{y}$ from the input features $x$:\n",
@@ -143,9 +145,9 @@
     "\n",
     "2. Train the model using the training data $D_{\\text{train}}$, while monitoring for \"overfitting\" on the validation dataset $D_{\\text{val}}$. We train by adjusting the parameters $\\theta$ to minimize a training loss, both of which we look at in more detail below.\n",
     "\n",
-    "3. After we decided to stop the training process, we typically test the model on the held-out dataset $D_{\\text{test}}$ that the training process has never seen, to get an independent assessment on how well the model will generalize towards new, unseen data.\n",
+    "3. After we decide to stop the training process, we typically test the model on the held-out dataset $D_{\\text{test}}$ that the training process has never seen, to get an independent assessment of how well the model will generalize towards new, unseen data.\n",
     "\n",
-    "Supervised learning is the staple of machine learning and its use has exploded in recent years to encompass almost any human economic activity, ranging from finance to healthcare and everything in between. Most recently the success of large language models is also based on supervised learning, where a \"transformer\"-based model is trained to predict the next word (or token) in a sequence, from very large textual datasets, a paradigm which is rapidly finding its way to different modalities like vision as well."
+    "Supervised learning is the staple of machine learning and its use has exploded in recent years to encompass almost any human economic activity, ranging from finance to healthcare and everything in between. Most recently the success of large language models is also based on supervised learning, where a *transformer*-based model is trained to predict the next word (or token) in a sequence, from very large textual datasets, a paradigm which is rapidly finding its way to different modalities like vision as well."
    ]
   },
   {
@@ -155,7 +157,7 @@
    "source": [
     "## Example: Interpolation in 1D\n",
     "\n",
-    "As an example, e formulate a simple regression problem that asks for interpolating functions in 1D. We will create a *differentiable* interpolation scheme that can be trained using samples from any function we want to interpolate, even functions with multi-dimensional outputs.\n",
+    "As an example, we formulate a simple regression problem that asks for interpolating functions in 1D. We will create a *differentiable* interpolation scheme that can be trained using samples from any function we want to interpolate, even functions with multi-dimensional outputs.\n",
     "\n",
     "The `LineGrid` class below is designed for this purpose, and divides up the 1D interval over which the function is defined in a number of *cells*, arranged in a 1D grid. It is initialized with two parameters:\n",
     "\n",
@@ -251,12 +253,12 @@
    "id": "E5lXyKlfEG3I",
    "metadata": {},
    "source": [
-    "```{index} mean squared error\n",
-    "```\n",
     "## Loss Functions\n",
     "\n",
     "> A loss function for every occasion.\n",
     "\n",
+    "```{index} mean squared error, loss function\n",
+    "```\n",
     "Different tasks require different loss functions, and a lot of creativity and research goes into crafting loss functions for complex tasks. For \"vanilla\" regression tasks, we typically use a **mean squared error** loss function as we already encountered before:\n",
     "\\begin{equation}\n",
     "\\mathcal{L}_{\\text{MSE}}(\\theta; D) \\doteq \\frac{1}{|D|} \\sum_{(x,y)\\in D}|f(x;\\theta)-y|^2\n",
@@ -288,7 +290,7 @@
    "id": "zaJW_r0PV78Z",
    "metadata": {},
    "source": [
-    "We used the vectorized versions of subtraction and power above, and then used the `mean` method of tensors. As you can see, the MSE loss in this case is 14.8715. Even though in this case the calculation is simple, many other loss functions exists and might not be that straightforward to implement. Luckily, PyTorch has many loss functions built-in:"
+    "We used the vectorized versions of subtraction and power above, and then used the `mean` method of tensors. As you can see, the MSE loss in this case is 13.045638. Even though in this case the calculation is simple, many other loss functions exist and might not be as  straightforward to implement. Luckily, PyTorch has many loss functions built-in:"
    ]
   },
   {
@@ -318,12 +320,14 @@
    "source": [
     "```{index} cross entropy\n",
     "```\n",
-    "For classification, the **cross entropy** loss function is very popular: it measures the average disagreement of the predicted labels with the ground truth labels:\n",
+    "For classification, the **cross entropy** loss function is very popular.\n",
+    "It measures the average disagreement of the predicted labels with the ground truth labels:\n",
     "\\begin{equation}\n",
     "\\mathcal{L}_{\\text{CE}}(\\theta; D) \\doteq \\sum_c \\sum_{(x,y=c)\\in D}\\log\\frac{1}{p_c(x;\\theta)}\n",
     "\\end{equation}\n",
     "\n",
-    "This formula seems perhaps unintuitive and rather complicated. However, it is actually quite intuitive once you understand a few concepts.\n",
+    "This formula seems perhaps unintuitive and rather complicated;\n",
+    "however, it is actually quite intuitive once you understand a few concepts.\n",
     "In particular, in the multi-class classification problem we assume that the model outputs a probability $p_c(x;\\theta)$ for every class $c\\in[N]$, where $N$ is the number of classes. The quantity \n",
     "\\begin{equation}\n",
     "\\log\\frac{1}{p_c(x;\\theta)}\n",
@@ -333,23 +337,24 @@
     "However, if the probability is only $0.01$, our surprise is $\\log\\frac{1}{0.01}=\\log 100 = 2$.\n",
     "The lower the probability, the higher the surprise. Hence, the cross-entropy above measures the *average surprise* for seeing the labeled examples in the training data. After training, the model is the least surprised possible, hopefully, which is why it is an intuitive loss function to minimize.\n",
     "\n",
-    "Note that training with cross-entropy does not guarantee that the outputs can be *truly* interpreted as probabilities: the recent field of \"model calibration\" has shown that especially neural networks can severely over-estimate those probability values in attempting to minimize the loss. If this interpretation is important for the application at hand, several techniques now exist to \"calibrate\" the models to be more interpretable that way."
+    "```{index} model calibration\n",
+    "```\n",
+    "Note that training with cross-entropy does not guarantee that the outputs can be *truly* interpreted as probabilities: the recent field of *model calibration* has shown that especially neural networks can severely over-estimate those probability values in attempting to minimize the loss. If this interpretation is important for the application at hand, several techniques now exist to *calibrate* the models to be more interpretable that way."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "pb2oEJG4Z8Dt",
    "metadata": {},
    "source": [
-    "```{index} gradient descent\n",
-    "```\n",
     "## Gradient Descent\n",
     "\n",
     "> Calculate gradient, reduce loss.\n",
     "\n",
     "A neural network output, and in particular a CNN, depends on the large set of continuous weights $W$ that make up its convolutional layers, pooling layers, and fully connected layers. In other words, the neural network is the model $f(x;\\theta)$ in the learning setup discussed above, and the weights $W$ are its parameters $\\theta$.\n",
     "\n",
-    "\n",
+    "```{index} gradient descent\n",
+    "```\n",
     "When we train a neural networks, we adjust its weights $W$ to perform better on the task at hand, be it classification or regression. To measure whether the model performs \"better\", we can use one of the loss functions defined above. To adjust the weights, we could calculate the gradient of the loss function with respect to each of the weights, and adjust the weights accordingly. That procedure is called **gradient descent**."
    ]
   },
@@ -414,18 +419,20 @@
    "id": "jyI361D_6bRA",
    "metadata": {},
    "source": [
-    "We can then use the PyTorch training code below, which is a standard way of training any differentiable function, including our LineGrid class. That is  because all the operations inside the LineGrid class are differentiable, so gradient descent will just work.\n",
+    "We can then use the PyTorch training code below, which is a standard way of training any differentiable function, including our `LineGrid` class. That is  because all the operations inside the `LineGrid` class are differentiable, so gradient descent will just work.\n",
     "\n",
-    "Inside the training loop below, you'll find the typical sequence of operations: zeroing gradients, performing a forward pass to get predictions, computing the loss, and doing a backward pass to update the model's parameters. Try to understand the code, as this same training loop is at the core of most deep learning architectures. Now, let's take a closer look at the code itself, which is extensively documented for clarity:"
+    "Inside the training loop below, you'll find the typical sequence of operations: zeroing gradients, performing a forward pass to get predictions, computing the loss, and doing a backward pass to update the model's parameters. Try to understand the code, as this same training loop is at the core of most deep learning architectures. Now, let's take a closer look at the code itself, which is extensively documented for clarity, and listed in Figure [2](#train_gd)."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": null,
    "id": "pFZvb4Mz458C",
    "metadata": {},
    "outputs": [],
    "source": [
+    "#| caption: Code to train a model using gradient descent.\n",
+    "#| label: code:train_gd\n",
     "def train_gd(model, dataset, loss_fn, callback=None, learning_rate=0.5, num_iterations=301):\n",
     "    # Initialize optimizer\n",
     "    optimizer = optim.SGD(model.parameters(), lr=learning_rate)\n",
@@ -1982,30 +1989,17 @@
    "id": "J1c0z_y-2s6E",
    "metadata": {},
    "source": [
-    "Note that gradient descent converges rather slow. You could try experimenting with the learning rate to speed this up. \n",
+    "The resulting loss function is shown in Figure [2](#fig:loss_training).\n",
+    "Note that gradient descent converges rather slowly.\n",
+    "You could try experimenting with the learning rate to speed this up. \n",
     "\n",
-    "After the training has converged, we can evaluate the resulting functions and plot the result against the training data, and we see that we get decent approximations of sin and cos, even with noisy training data:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "id": "lMSjElNmAaMt",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "x_sorted = torch.sort(x_samples).values\n",
-    "y_pred = model(x_sorted).detach().numpy()\n",
-    "fig = plotly.graph_objects.Figure()\n",
-    "fig.add_scatter(x=x_samples, y=y_samples[:, 0], mode='markers', name='sin')\n",
-    "fig.add_scatter(x=x_samples, y=y_samples[:, 1], mode='markers', name='cos')\n",
-    "fig.add_scatter(x=x_sorted, y=y_pred[:, 0], mode='lines', name='predicted sin')\n",
-    "fig.add_scatter(x=x_sorted, y=y_pred[:, 1], mode='lines', name='predicted cos');\n"
+    "After the training has converged, we can evaluate the resulting functions and plot the result against the training data,\n",
+    "and Figure [3](#fig:sin_cos_approx) that we get decent approximations of sin and cos, even with noisy training data."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": null,
    "id": "bo8SalWnHCFO",
    "metadata": {},
    "outputs": [
@@ -4880,18 +4874,25 @@
    "source": [
     "#| caption: Learned approximation of the sine and cosine functions. The model has learned to fit the data.\n",
     "#| label: fig:sin_cos_approx\n",
-    "fig.show()\n"
+    "x_sorted = torch.sort(x_samples).values\n",
+    "y_pred = model(x_sorted).detach().numpy()\n",
+    "fig = plotly.graph_objects.Figure()\n",
+    "fig.add_scatter(x=x_samples, y=y_samples[:, 0], mode='markers', name='sin')\n",
+    "fig.add_scatter(x=x_samples, y=y_samples[:, 1], mode='markers', name='cos')\n",
+    "fig.add_scatter(x=x_sorted, y=y_pred[:, 0], mode='lines', name='predicted sin')\n",
+    "fig.add_scatter(x=x_sorted, y=y_pred[:, 1], mode='lines', name='predicted cos');\n",
+    "fig.show()"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "Cc3kfkWGei-x",
    "metadata": {},
    "source": [
-    "```{index} pair: stochastic gradient descent; SGD\n",
-    "```\n",
     "## Stochastic Gradient Descent\n",
     "\n",
+    "```{index} pair: stochastic gradient descent; SGD\n",
+    "```\n",
     "**Stochastic gradient descent** or **SGD** is an approximate gradient descent procedure, to cope with the very large data sets typically thrown at supervised problems. It is typically impossible to calculate the *exact* gradient, which requires looping over all the examples, which can run in the millions. An easy approximation scheme is to *randomly sample* a small subset of the examples, and calculate the gradient of the weights using only those examples. The upside is that this is much faster, but the downside is that this is only approximate. Hence, if we adjust weights with this approximate gradient, we might or might not make progress on the task. This procedure is called stochastic gradient descent, and it works amazingly well in practice.\n",
     "\n",
     "The `DataLoader` class in PyTorch makes implementing SGD very easy: it can wrap any `Dataset` instance, and then retrieves training samples one \"mini-batch\" at a time. The code below uses a mini-batch size of 25, but feel free to experiment with different values for both this parameter and the learning rate to get a feel for what happens. Note that by convention we refer to one execution of the inner loop below, over a mini-batch, as an \"iteration\". One full cycle through the dataset by randomly selecting mini-batches is referred to as an \"epoch\"."
@@ -5905,7 +5906,8 @@
    "id": "E61thb3Ae0LD",
    "metadata": {},
    "source": [
-    "Note that we converged *much* faster in this case: in just 30 iterations we reached the same low loss as with 300 iterations before. The answer is because with 250 training samples and mini-batches of size 25, each epoch adjusts the model's parameters 10 times. This effectively boosts the learning rate by a factor of 10. However, note that because each mini-batch looks at only one 10th of the dataset, each mini-batch's adjustment could *adversely* affect the performance on the other training samples."
+    "The training loss is shown in Figure [4](#fig:loss_training_sgd).\n",
+    "Note that we converge *much* faster in this case: in just 30 iterations we reached the same low loss as with 300 iterations before. The answer is because with 250 training samples and mini-batches of size 25, each epoch adjusts the model's parameters 10 times. This effectively boosts the learning rate by a factor of 10. However, note that because each mini-batch looks at only one 10th of the dataset, each mini-batch's adjustment could *adversely* affect the performance on the other training samples."
    ]
   },
   {