Skip to content

Commit 24412ca

Browse files
committed
Chapter 6, done!
1 parent 913195e commit 24412ca

File tree

3 files changed

+62
-67
lines changed

3 files changed

+62
-67
lines changed

S65_driving_planning.ipynb

Lines changed: 24 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"id": "UcRF7OziizF1",
66
"metadata": {},
77
"source": [
8-
"# Planning for Autonomous Driving."
8+
"# Planning for Autonomous Driving"
99
]
1010
},
1111
{
@@ -23,7 +23,7 @@
2323
},
2424
{
2525
"cell_type": "code",
26-
"execution_count": 1,
26+
"execution_count": null,
2727
"id": "tLBxSLGeWPV0",
2828
"metadata": {
2929
"tags": [
@@ -32,12 +32,12 @@
3232
},
3333
"outputs": [],
3434
"source": [
35-
"%pip install -q -U gtbook\n"
35+
"%pip install -q -U gtbook"
3636
]
3737
},
3838
{
3939
"cell_type": "code",
40-
"execution_count": 87,
40+
"execution_count": null,
4141
"id": "ewrl5k4_akQV",
4242
"metadata": {
4343
"tags": [
@@ -46,7 +46,7 @@
4646
},
4747
"outputs": [],
4848
"source": [
49-
"# no imports (yet)\n"
49+
"# no imports (yet)"
5050
]
5151
},
5252
{
@@ -114,15 +114,13 @@
114114
"id": "R0sQPSK681cf",
115115
"metadata": {},
116116
"source": [
117-
"```{index} motion primitives\n",
118-
"```\n",
119117
"## Motion Primitives\n",
120118
"\n",
121119
"Consider a car traveling in reverse that wishes to suddenly change its orientation\n",
122120
"by completing a rapid 180-degree turn (a favorite maneuver for drivers like James Bond and Steve McQueen).\n",
123121
"How would we go about implementing this type of maneuver in an autonomous vehicle?\n",
124122
"\n",
125-
"This two approaches we have considered before can be very inefficient for planning trajectories that have such well-defined\n",
123+
"The two approaches we have considered before can be very inefficient for planning trajectories that have such well-defined\n",
126124
"characteristics.\n",
127125
"For all of our probabilistic methods, we used a discrete time formulation and considered\n",
128126
"the effects of executing an action (e.g., move forward, move left) for a small duration of time, $\\Delta t$.\n",
@@ -132,6 +130,8 @@
132130
"In each case, the language of path segments is very simple, and in each case,\n",
133131
"a full plan will consist of many sequential steps.\n",
134132
"\n",
133+
"```{index} motion primitives\n",
134+
"```\n",
135135
"Instead, the U-turn maneuver could be achieved by a predefined\n",
136136
"sequence of steps: after achieving a reasonable speed, remove your foot from the gas pedal;\n",
137137
"turn left sharply and hit the breaks; at the perfect moment, release the breaks\n",
@@ -146,8 +146,7 @@
146146
"id": "53y_6iTD1Ptz",
147147
"metadata": {},
148148
"source": [
149-
"This idea is illustrated in the figure below, which shows four motion primitives\n",
150-
"for a car.\n",
149+
"This idea is illustrated in the Figure [1](#fig:MotionPrimitives), which shows four motion primitives for a car.\n",
151150
"The primitive $P_1$ corresponds to driving forward, while motion primitives $P_2$, $P_3$, and $P_4$ correspond to veering\n",
152151
"to the left at increasingly sharp angles."
153152
]
@@ -157,14 +156,13 @@
157156
"id": "y_GGvtQc94pI",
158157
"metadata": {},
159158
"source": [
160-
"```{index} polynomial trajectories, splines\n",
161-
"```\n",
162-
"\n",
163159
"<figure id=\"fig:MotionPrimitives\">\n",
164160
"<img src=\"https://github.com/gtbook/robotics/blob/main/Figures6/motion-primitives.png?raw=1\" style=\"width:18cm\" alt=\"\">\n",
165161
"<figcaption>Four motion primitives for a car veering to its left. </figcaption>\n",
166162
"</figure>\n",
167163
"\n",
164+
"```{index} polynomial trajectories, splines\n",
165+
"```\n",
168166
"Motion primitives can be defined in numerous ways.\n",
169167
"The figure above illustrates four fixed motion primitives, but it would not be difficult to generalize each of these\n",
170168
"to a class of motions by using parametric descriptions. \n",
@@ -208,7 +206,7 @@
208206
"For example, the traffic in rural Georgia is irrelevant when leaving downtown Atlanta on\n",
209207
"a trip to Boston.\n",
210208
"In this case, immediate driving decisions depend on the car just ahead, and the nearby\n",
211-
"cars in adjacent lanes.\n"
209+
"cars in adjacent lanes."
212210
]
213211
},
214212
{
@@ -218,7 +216,7 @@
218216
"source": [
219217
"## Polynomial Trajectories\n",
220218
"\n",
221-
"Let’s begin with the simple problem of changing lanes along a straight stretch of highway. The situation is illustrated in the figure below.\n",
219+
"Let’s begin with the simple problem of changing lanes along a straight stretch of highway. The situation is illustrated in Figure [2](#fig:LaneChange).\n",
222220
"\n",
223221
"<figure id=\"fig:LaneChange\">\n",
224222
"<img src=\"https://github.com/gtbook/robotics/blob/main/Figures6/lane-change.png?raw=1\" style=\"width:18cm\" alt=\"\">\n",
@@ -239,7 +237,7 @@
239237
"\\end{equation}\n",
240238
"At the start of the maneuver, $s=0$, which matches the initial condition $d(0)=0$, and at $ s = s_\\mathrm{g}$\n",
241239
"we match the end condition $d(s_\\mathrm{g}) = d_\\mathrm{g}$.\n",
242-
"This trajectory is illustrated in the figure below.\n",
240+
"This trajectory is illustrated in Figure [3](#fig:LinearLaneChange).\n",
243241
"\n",
244242
"<figure id=\"fig:LinearLaneChange\">\n",
245243
"<img src=\"https://github.com/gtbook/robotics/blob/main/Figures6/linear-lane-change.png?raw=1\" style=\"width:18cm\" alt=\"\">\n",
@@ -305,7 +303,7 @@
305303
"\\end{aligned}\n",
306304
"\\end{equation}\n",
307305
"Note that these six equations are all linear in the parameters $\\alpha_i$, so it is a simple matter to solve\n",
308-
"these."
306+
"them."
309307
]
310308
},
311309
{
@@ -320,13 +318,13 @@
320318
"While the derivation above produced a single polynomial trajectory,\n",
321319
"it is a simple matter to extend this formalism to construct trajectories\n",
322320
"that are composed of multiple consecutive polynomial segments.\n",
323-
"Such trajectores belong to the more general class of **splines**.\n",
321+
"Such trajectories belong to the more general class of **splines**.\n",
324322
"In general, a spline is a continuous, piecewise polynomial curve, and we are not\n",
325323
"necessarily given the specific values for the transition points between adjacent\n",
326324
"segments.\n",
327325
"\n",
328-
"In fact, we have actually done exactly this in the above derviation,\n",
329-
"if we consider that for $s < 0$ and for $s > s_\\mathrm{g}$ the trajectory $d(s)$ is linear and pararallel to the $s$-axis,\n",
326+
"In fact, we have actually done exactly this in the above derivation,\n",
327+
"if we consider that for $s < 0$ and for $s > s_\\mathrm{g}$ the trajectory $d(s)$ is linear and parallel to the $s$-axis,\n",
330328
"i.e., we have solved for a special case of three polynomial segments with two of those\n",
331329
"segments being linear and one quintic.\n",
332330
"\n",
@@ -408,11 +406,11 @@
408406
"trajectory, becomes an important problem. In this section, we address the problem\n",
409407
"of following such a trajectory.\n",
410408
"\n",
411-
"The figure below illustrates the situation.\n",
412-
"We denote by $\\gamma(s)$ the desired trjactory of the car, where $s$, an arc length\n",
409+
"Figure [4](#fig:FrenetFrame) illustrates the situation.\n",
410+
"We denote by $\\gamma(s)$ the desired trajectory of the car, where $s$, an arc length\n",
413411
"parameter, is a function of time, and therefore the instantaneous desired speed of \n",
414412
"the car is $\\dot{s}(t)$.\n",
415-
"Since the goal is to keep the car on the deisred trajectory, it is convenient\n",
413+
"Since the goal is to keep the car on the desired trajectory, it is convenient\n",
416414
"to represent the state of the car in a coordinate frame that is local to the trajectory.\n",
417415
"To do so, for each point along $\\gamma$, we define a frame with origin $\\gamma(s)$,\n",
418416
"with axes $t_\\gamma(s)$ and $n_\\gamma(s)$, the tangent and normal vectors\n",
@@ -480,15 +478,15 @@
480478
"want the maneuver to take too long), and the comfort of the human passenger.\n",
481479
"As mentioned above, humans are sensitive to acceleration changes in the lateral direction,\n",
482480
"therefore, we might wish to minimize the overall effect of such changes.\n",
483-
"Mathematically, the instantaneous change in lateral cceleration is given by the third\n",
481+
"Mathematically, the instantaneous change in lateral acceleration is given by the third\n",
484482
"derivative of $d$, which is known as the **jerk**.\n",
485483
"For a given trajectory $d(t)$, defined on the interval $[0,T]$, the following\n",
486-
"cost functional penalizes aggregate jerk and total exectution time\n",
484+
"cost functional penalizes aggregate jerk and total execution time\n",
487485
"\\begin{equation}\n",
488486
"J(d) = \\int_0^T \\left(\\frac{d}{dt}\\ddot{d}(\\tau)\\right)^2 d\\tau + \\beta T\n",
489487
"\\end{equation}\n",
490488
"In general, it may not be possible to solve this optimization problem in real time.\n",
491-
"In such cases, rather than using $J$ to solve find the optimal $d$, we\n",
489+
"In such cases, rather than using $J$ to find the optimal $d$, we\n",
492490
"can use a generate-and-test approach.\n",
493491
"With such an approach, several values of $T$ are proposed, and the corresponding quintic trajectories\n",
494492
"are computed for each of these. It is then a simple matter to evaluate the cost of each of\n",

S66_driving_DRL.ipynb

Lines changed: 29 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
},
2424
{
2525
"cell_type": "code",
26-
"execution_count": 1,
26+
"execution_count": null,
2727
"id": "pVeijfbAiYRG",
2828
"metadata": {
2929
"tags": [
@@ -40,7 +40,7 @@
4040
}
4141
],
4242
"source": [
43-
"%pip install -q -U gtbook\n"
43+
"%pip install -q -U gtbook"
4444
]
4545
},
4646
{
@@ -94,7 +94,6 @@
9494
"source": [
9595
"```{index} pair: deep reinforcement learning; DRL\n",
9696
"```\n",
97-
"\n",
9897
"Deep reinforcement learning (DRL) applies the power of deep learning to bring reinforcement learning to much more complex domains than what we were able to tackle with the Markov Decision Processes and RL concepts introduced in Chapter 3. The use of large, expressive neural networks has allowed researchers and practitioners alike to work with high bandwidth sensors such as video streams and LIDAR, and bring the promise of RL into real-world domains such as autonomous driving. This is still a field of active discovery and research, however, and we can give but a brief introduction here about what is a vast literature and problem space."
9998
]
10099
},
@@ -139,11 +138,10 @@
139138
"id": "0xi2Y6T5YAtY",
140139
"metadata": {},
141140
"source": [
142-
"```{index} deep reinforcement learning; DQN\n",
143-
"```\n",
144-
"\n",
145141
"## Deep Q-Learning\n",
146142
"\n",
143+
"```{index} deep reinforcement learning; DQN\n",
144+
"```\n",
147145
"> DQN is an early deep learning RL method akin to Q-learning.\n",
148146
"\n",
149147
"Recall from Section 3.6 that we can define a policy in terms of **Q-values**, sometimes also called state-action values, and that we can define the optimal policy as \n",
@@ -166,14 +164,16 @@
166164
"id": "pCKixeLwsh2Z",
167165
"metadata": {},
168166
"source": [
169-
"```{index} execution phase, experience replay\n",
167+
"```{index} pair: deep Q-network; DQN\n",
170168
"```\n",
171169
"In the **deep Q-network** or DQN method we use a *supervised learning* approach to Q-learning. We train a neural network, parameterized by $\\theta$, to approximate the optimal Q-values:\n",
172170
"\\begin{equation}\n",
173171
"Q^*(x,a) \\approx \\hat{Q}(x,a; \\theta)\n",
174172
"\\end{equation}\n",
175173
"It might be worthwhile at this point to re-visit Section 5.6, where we introduced neural networks and how to train them using stochastic gradient descent (SGD).\n",
176174
"\n",
175+
"```{index} execution phase, experience replay\n",
176+
"```\n",
177177
"In the context of RL, the DQN method uses two additional ideas that are crucial in making the training converge to something sensible in difficult problems. The first is splitting the training into *execution* and *experience replay* phases:\n",
178178
"\n",
179179
"- during the **execution phase**, the policy is executed (possibly with some degree of randomness) and the experiences $(x,a,r,x')$, with $r$ the reward, are stored in a dataset $D$;\n",
@@ -189,6 +189,8 @@
189189
"\\mathcal{L}_{\\text{DQN}}(\\theta; D) \\doteq \\sum_{(x,a,r,x')\\in D} [\\text{target}(x,a,x') - Q(x,a; \\theta)]^2\n",
190190
"\\end{equation}\n",
191191
"\n",
192+
"```{index} off-policy RL\n",
193+
"```\n",
192194
"With this basic scheme, a team from DeepMind was able to achieve human or super-human performance on about 50 Atari 2600 games in 2015 {cite:p}`Mnih15nature_dqn`.\n",
193195
"DQN is a so-called **off-policy** method, in that each execution phase uses the best policy we computed so far, but we can still replay earlier experiences gathered with \"lesser\" policies. Nothing in the experience replay phase references the policy: every experience leads to a valid Q-value backup and a valid supervised learning signal."
194196
]
@@ -198,19 +200,22 @@
198200
"id": "D6PHabNMU4OO",
199201
"metadata": {},
200202
"source": [
201-
"```{index} stochastic policy, deep reinforcement learning; policy optimization\n",
202-
"```\n",
203-
"\n",
204203
"## Policy Optimization\n",
205204
"\n",
205+
"```{index} deep reinforcement learning; policy optimization\n",
206+
"```\n",
206207
"> Policy optimization takes a black box optimization approach to a deep policy.\n",
207208
"\n",
209+
"```{index} stochastic policy\n",
210+
"```\n",
208211
"Whereas the above gets at an optimal policy indirectly, via deep Q-learning, a different and very popular idea is to directly parameterize the policy using a neural network, with weights $\\theta$. It is common to make this a **stochastic policy**,\n",
209212
"\\begin{equation}\n",
210213
"\\pi(a|x; \\theta)\n",
211214
"\\end{equation}\n",
212215
"where $a \\in {\\cal A}$ is an action, $x \\in {\\cal X}$ is a state, and the policy outputs a *probability* for each action $a$ based on the state $x$. One of the reasons to prefer stochastic policies is that they are differentiable, as they output continuous values rather than discrete actions. This allows us to optimize for them via gradient descent, as we explore in the next section.\n",
213216
"\n",
217+
"```{index} cross-entropy\n",
218+
"```\n",
214219
"In Chapter 5 we used *supervised* learning to train neural networks, and we just applied this for learning Q-values in DQN. It is useful to consider how this might work for training a *policy*. Recall from Section 5.6 that we defined the empirical cross-entropy loss as\n",
215220
"\\begin{equation}\n",
216221
"\\mathcal{L}_{\\text{CE}}(\\theta; D) \\doteq - \\sum_{(x,y=c)\\in D} \\sum_c \\log p_c(x;\\theta)\n",
@@ -231,7 +236,9 @@
231236
"id": "So1rSw4zS-C5",
232237
"metadata": {},
233238
"source": [
234-
"In **policy optimization** we gather data by rolling out a set of trajectories $\\tau_i$. In supervised learning we have a dataset $D$ and labels $y_c$, but we have to proceed a bit differently in a reinforcement learning setting. In particular, for *on-policy* RL we gather data by executing our current best guess for the policy for some rollout length or horizon $H$, and we do this many different times, each time obtaining a *trajectory* $\\tau_i$.\n",
239+
"```{index} policy optimization, off-policy RL\n",
240+
"```\n",
241+
"In **policy optimization** we gather data by rolling out a set of trajectories $\\tau_i$. In supervised learning we have a dataset $D$ and labels $y_c$, but we have to proceed a bit differently in a reinforcement learning setting. In particular, for **on-policy** RL we gather data by executing our current best guess for the policy for some rollout length or horizon $H$, and we do this many different times, each time obtaining a *trajectory* $\\tau_i$.\n",
235242
"That still leaves the training signal: where does that come from? \n",
236243
"The key idea is to estimate how good a particular action is by estimating the state-action values $Q$ from the rollout rewards.\n",
237244
"In detail, we estimate the expected discounted reward starting at $x_{it}$, and taking action $a_{it}$, as\n",
@@ -257,9 +264,9 @@
257264
"- Initialize $\\theta$\n",
258265
"- Until convergence:\n",
259266
" 1. roll out a number of trajectories $\\tau_i$ using the current policy $\\pi(a;x,\\theta)$\n",
260-
" 2. try and change the parameters $\\theta$ as to decrease the surrogate loss function $\\mathcal{L}(\\theta)$\n",
267+
" 2. try to change the parameters $\\theta$ as to decrease the surrogate loss function $\\mathcal{L}(\\theta)$\n",
261268
" \n",
262-
"A simple, gradient-free approach for step 2 is simple hill-climbing aka stochastic search:\n",
269+
"A simple, gradient-free approach for step 2 is simple hill-climbing, aka stochastic search:\n",
263270
"\n",
264271
" - perturb $\\theta$ to $\\theta'$\n",
265272
" - set $\\theta \\leftarrow \\theta'$ *iff* $\\mathcal{L}(\\theta') < \\mathcal{L}(\\theta)$\n",
@@ -273,18 +280,23 @@
273280
"id": "-sLUpvmQ2sNd",
274281
"metadata": {},
275282
"source": [
276-
"```{index} deep reinforcement learning; policy gradient methods\n",
277-
"```\n",
278-
"\n",
279283
"## Policy Gradient Methods\n",
280284
"\n",
285+
"```{index} deep reinforcement learning; policy gradient methods\n",
286+
"```\n",
281287
"> Policy gradient methods are akin to policy iteration, with a neural flavor.\n",
282288
"\n",
289+
"```{index} softmax function, logit\n",
290+
"```\n",
283291
"In a nutshell, policy gradient methods calculate the *gradient* of the surrogate loss $\\mathcal{L}(\\theta)$ defined above with respect to the policy parameters $\\theta$:\n",
284292
"\\begin{equation}\n",
285293
"\\nabla_\\theta \\mathcal{L}(\\theta) \\leftarrow - \\sum_i \\sum_{t=1}^H \\hat{Q}(x_{it},a_{it}) \\nabla_\\theta \\log \\pi(a_{it}|x_{it}, \\theta),\n",
286294
"\\end{equation}\n",
287-
"where $\\nabla_\\theta \\log \\pi(a_{it}|x_{it}, \\theta)$ is the gradient of the logarithm of the stochastic policy. This is easily obtained via back-propagation using any neural network framework of choice. In the case that actions are discrete, as in our example above, a stochastic policy network typically has a \"softmax\" function at the end. Then $\\nabla_\\theta \\log \\pi(a_{it}|x_{it}, \\theta)$ is the derivative of the \"logit\" layer right before the softmax function.\n",
295+
"where $\\nabla_\\theta \\log \\pi(a_{it}|x_{it}, \\theta)$ is the gradient of the logarithm of the stochastic policy. This is easily obtained via back-propagation using any neural network framework of choice. In the case that actions are discrete, as in our example above, a stochastic policy network typically ends with a *softmax* function. Recall that if the network outputs a vector of raw scores (the *logits*) $z \\in \\mathbb{R}^K$, the softmax is defined as\n",
296+
"\\begin{equation}\n",
297+
"\\mathrm{softmax}(z)_i = \\frac{e^{z_i}}{\\sum_{j=1}^K e^{z_j}}, \\quad i = 1,\\dots,K.\n",
298+
"\\end{equation}\n",
299+
"Thus, the logits are the raw outputs before applying the softmax, and $\\nabla_\\theta \\log \\pi(a_{it}|x_{it}, \\theta)$ is computed with respect to *these* values.\n",
288300
"We then use gradient descent to update the policy parameters:\n",
289301
"\\begin{equation}\n",
290302
"\\theta \\leftarrow \\theta - \\alpha \\nabla_\\theta \\mathcal{L}(\\theta)\n",
@@ -294,12 +306,6 @@
294306
"The algorithm above, using the estimated Q-values, is almost identical to the REINFORCE method {cite:p}`Williams92ml_reinforce`. That algorithm further improves on performance by not using the raw Q-values but rather the difference between the Q-values and some baseline policy. This has the effect of reducing the variance in the estimated Q-values due to using only a finite amount of data.\n",
295307
"The REINFORCE algorithm was introduced in 1992 and hence pre-dates the deep-learning revolution by about 20 years. It should also be said that in DRL, the neural networks that are used are typically not very deep. Several modern methods, such as \"proximal policy optimization\" (PPO) {cite:p}`Schulman17_PPO` apply a number of techniques to improve this basic method even further and make it more sample-efficient. PPO is now one of the most often-used DRL methods."
296308
]
297-
},
298-
{
299-
"cell_type": "markdown",
300-
"id": "xtNoiDaqfViL",
301-
"metadata": {},
302-
"source": []
303309
}
304310
],
305311
"metadata": {

0 commit comments

Comments
 (0)