Make some edits

dellaert · dellaert · commit 62e0b559f63a · 2024-12-23T11:20:57.000-05:00
diff --git a/S66_driving_DRL.ipynb b/S66_driving_DRL.ipynb
@@ -85,9 +85,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "```{index} lateral control, longitudinal control\n",
+    "```{index} lateral control, longitudinal control, lane switching\n",
     "```\n",
-    "A simple example in the autonomous driving domain is *lane switching*. Suppose we are driving along at 3-lane highway, and we can see some ways ahead, and some ways behind us. We are driving at a speed that is comfortable to us, but other cars have different ideas about the optimal speed to drive at. Hence, sometimes we would like to change lanes, and we could learn a policy to do this for us. As discussed in Section 6.5, this is **lateral control**. A more sophisticated example would also allow us to adapt our speed to the traffic pattern, but by relying on a smart cruise control system we could safely ignore the **longitudinal control** problem."
+    "A simple example in the autonomous driving domain is *lane switching*. Suppose we are driving along at 3-lane highway, and we can see some ways ahead, and -using the rear-view mirror- some ways behind us. We are driving at a speed that is comfortable to us, but other cars have different ideas about their optimal driving speed. Hence, sometimes we would like to change lanes, and we could learn a policy to do this for us. As discussed in Section 6.5, this is **lateral control**. A more sophisticated example would also allow us to adapt our speed to the traffic pattern, but by relying on a smart cruise control system we could safely ignore the **longitudinal control** problem."
    ]
   },
   {
@@ -121,11 +121,15 @@
     "\\begin{equation}\n",
     "\\pi^*(x) = \\arg \\max_a Q^*(x,a)\n",
     "\\end{equation}\n",
-    "where $Q^*(x,a)$ denote the Q-values for the *optimal* policy. In Q-learning, we start with some random Q-values and then iteratively improve the estimate for the optimal Q-values by alpha-blending between old and new estimates:\n",
+    "where $Q^*(x,a)$ denote the Q-values for the *optimal* policy. In Q-learning, we start with some random Q-values and then iteratively improve an estimate $\\hat{Q}(x,a)$ for the optimal Q-values by alpha-blending between old and new estimates:\n",
     "\\begin{equation}\n",
-    "\\hat{Q}(x,a) \\leftarrow (1-\\alpha) \\hat{Q}(x,a) + \\alpha~\\text{target}(x,a,x')\n",
+    "\\hat{Q}(x,a) \\leftarrow (1-\\alpha) \\hat{Q}(x,a) + \\alpha~\\text{target}(x,a,x').\n",
     "\\end{equation}\n",
-    "where $\\text{target}(x,a,x') \\doteq R(x,a,x') + \\gamma \\max_{a'} \\hat{Q}(x',a')$ is the \"target\" value that we think is an improvement on the previous value $\\hat{Q}(x,a)$. Indeed: the target $\\text{target}(x,a,x')$ uses the current estimate of the Q-values for future states, but improves on this by using the *known* reward $R(x,a,x')$ for the current action in the current state."
+    "Above, the \"target value\"\n",
+    "\\begin{equation}\n",
+    "\\text{target}(x,a,x') \\doteq R(x,a,x') + \\gamma \\max_{a'} \\hat{Q}(x',a')\n",
+    "\\end{equation}\n",
+    "is a value that we think is an improvement on the previous value $\\hat{Q}(x,a)$. Indeed: $\\text{target}(x,a,x')$ uses the *current* estimate of the Q-values for future states, but improves on this by using the *known* rewards $R(x,a,x')$ for the current action $a$ in the current state $x$."
    ]
   },
   {
@@ -134,11 +138,13 @@
    "source": [
     "```{index} execution phase, experience replay\n",
     "```\n",
-    "In the **deep Q-network** or DQN method we use a *supervised learning* approach to Q-learning, by training a neural network, parameterized by $\\theta$, to approximate the optimal Q-values:\n",
+    "In the **deep Q-network** or DQN method we use a *supervised learning* approach to Q-learning. We train a neural network, parameterized by $\\theta$, to approximate the optimal Q-values:\n",
     "\\begin{equation}\n",
-    "Q^*(x,a) \\approx Q(x,a; \\theta)\n",
+    "Q^*(x,a) \\approx \\hat{Q}(x,a; \\theta)\n",
     "\\end{equation}\n",
-    "It might be worthwhile to re-visit Section 5.6, where we introduced neural networks and how to train them using stochastic gradient descent (SGD). In the context of RL, the DQN method uses two additional ideas that are crucial in making the training converge to something sensible in difficult problems. The first is splitting the training into *execution* and *experience replay* phases:\n",
+    "It might be worthwhile at this point to re-visit Section 5.6, where we introduced neural networks and how to train them using stochastic gradient descent (SGD).\n",
+    "\n",
+    "In the context of RL, the DQN method uses two additional ideas that are crucial in making the training converge to something sensible in difficult problems. The first is splitting the training into *execution* and *experience replay* phases:\n",
     "\n",
     "- during the **execution phase**, the policy is executed (possibly with some degree of randomness) and the experiences $(x,a,r,x')$, with $r$ the reward, are stored in a dataset $D$;\n",
     "- during **experience replay**, we *randomly sample* from these experiences to create mini-batches of data, which are in turn used to perform SGD on the parameters $\\theta$.\n",
@@ -154,7 +160,6 @@
     "\\end{equation}\n",
     "\n",
     "With this basic scheme, a team from DeepMind was able to achieve human or super-human performance on about 50 Atari 2600 games in 2015 {cite:p}`Mnih15nature_dqn`.\n",
-    "\n",
     "DQN is a so-called **off-policy** method, in that each execution phase uses the best policy we computed so far, but we can still replay earlier experiences gathered with \"lesser\" policies. Nothing in the experience replay phase references the policy: every experience leads to a valid Q-value backup and a valid supervised learning signal."
    ]
   },