You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: book/rl.html
+27Lines changed: 27 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -299,6 +299,33 @@ <h1>Using a Drake simulation as an Gym environment</h1>
299
299
<li> Implement the advantage function. </li>
300
300
</ol>
301
301
</exercise>
302
+
<exerciseid="rl-box-flipup"><h1>Analyzing Box Flipping with RL</h1>
303
+
In this exercise, you will analyze the behavior of a <ahref="https://arxiv.org/abs/1707.06347">PPO</a> policy trained to flip over a box. Like REINFORCE, PPO is a policy-gradient method that
304
+
directly optimizes the policy parameters to maximize the value function. In order to have an easier problem to analyze,
305
+
we'll use the <ahref="https://manipulation.csail.mit.edu/force.html#force_flip_up">box flipup example</a> from Chapter 8. Our robot will be a simple point finger and
306
+
the goal will be to flip over the box. You can find the code used to train the policy <ahref="https://github.com/RussTedrake/manipulation/blob/master/book/rl/train_boxflipup.py">here</a>.
307
+
<oltype="a">
308
+
<li> Take a look at the <ahref="https://github.com/RussTedrake/manipulation/blob/master/manipulation/envs/box_flipup.py">code</a> used to generate the environment.
309
+
Let $\theta$ denote the angle of the box from the vertical, $\omega$ denote the angular velocity of the box, $q_f$ denote the observed position of the finger,
310
+
$v_f$ denote the velocity of the finger, and $u_f$ denote the commanded position of the finger. What is the reward function used here to train the policy?
311
+
Write it down mathematically (use the modulo operator to handle the wrap-around of the angle). What do the individual terms in the reward
312
+
function represent? Why do they make sense?</li>
313
+
<li> Although we will not go into the exact details of how PPO works here, it works quite similarly to REINFORCE but using both (i) a learned value function to reduce variance, and (ii) an approximate
314
+
objective, along with a trust-region constraint by clipping the per-sample loss to ensure that the policy is not updated too much at each step. Briefly explain why you think that
315
+
(a) PPO might be more stable and sample efficient than REINFORCE, and (b) how you might expect PPO to perform on the box flipping task if the clipping limits are set to be too small or too large.</li>
316
+
<li> We've trained a PPO-based policy to flip the box for 3,000,000 steps (see
317
+
<ahref="https://youtube.com/playlist?list=PLOZK7fx6sI6lvDalINByA_kbYfBxn76by&si=XUndO5UDHMJcywr2">here</a> for videos of the
318
+
policy in action at each of the timesteps). How does the policy perform as the number of steps increases? Write qualitatively
319
+
how the policy changes over time and which parts of the reward function are having the greatest effect at each step.
320
+
</li>
321
+
</ol>
322
+
Notice how much time it takes to train a working policy, even for a simple manipulation problem like the 2D box flipping example with a point finger and a dense reward.
323
+
Harder problems in manipulation (such as pick and place) can become extremely challenging to train naïvely with Reinforcement Learning, especially with
324
+
sparse rewards such as in typical pick and place tasks where you only receive a reward when the object has been picked or placed in the right location. On the other hand,
325
+
reinforcement learning can work well in contact-rich settings (as in the box flipping example); see <ahref="https://www.youtube.com/watch?v=x4O8pojMF0w">RL solving a rubik's cube with one hand</a>
326
+
for an example of RL being used to solve a contact-rich manipulation task (note this also depended heavily on things like domain randomization, curriculum learning, large scale compute, etc.).
327
+
The story in locomotion, on the other hand, seems to be quite different, perhaps because it is easier to design dense rewards and to automate resets in simulation.
0 commit comments