Skip to content

Commit 5f57fab

Browse files
authored
Merge branch 'master' into pset8
2 parents 1cee711 + bb10ec8 commit 5f57fab

File tree

12 files changed

+572
-20
lines changed

12 files changed

+572
-20
lines changed

.github/workflows/main.yml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,14 +51,22 @@ jobs:
5151
check-approval:
5252
name: Check PR approval status
5353
runs-on: ubuntu-latest
54-
if: github.event_name == 'pull_request_target'
54+
if: github.event_name == 'pull_request_target' || github.event_name == 'push' || github.event_name == 'schedule'
5555
outputs:
5656
should-run: ${{ steps.check.outputs.should-run }}
5757
is-trusted: ${{ steps.check.outputs.is-trusted }}
5858
steps:
5959
- name: Check approval status
6060
id: check
6161
run: |
62+
# For push events (commits to master) and scheduled runs, always run
63+
if [[ "${{ github.event_name }}" != "pull_request_target" ]]; then
64+
echo "is-trusted=true" >> $GITHUB_OUTPUT
65+
echo "should-run=true" >> $GITHUB_OUTPUT
66+
echo "✅ Push or scheduled event - running tests automatically"
67+
exit 0
68+
fi
69+
6270
# Check if this is a trusted contributor
6371
if [[ "${{ github.event.pull_request.author_association }}" == "COLLABORATOR" ]] || \
6472
[[ "${{ github.event.pull_request.author_association }}" == "MEMBER" ]] || \

book/chapters.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
424 KB
Loading
173 KB
Loading

book/figures/sam/raw_img.png

280 KB
Loading

book/rl.html

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,33 @@ <h1>Using a Drake simulation as an Gym environment</h1>
299299
<li> Implement the advantage function. </li>
300300
</ol>
301301
</exercise>
302+
<exercise id="rl-box-flipup"><h1>Analyzing Box Flipping with RL</h1>
303+
In this exercise, you will analyze the behavior of a <a href="https://arxiv.org/abs/1707.06347">PPO</a> policy trained to flip over a box. Like REINFORCE, PPO is a policy-gradient method that
304+
directly optimizes the policy parameters to maximize the value function. In order to have an easier problem to analyze,
305+
we'll use the <a href="https://manipulation.csail.mit.edu/force.html#force_flip_up">box flipup example</a> from Chapter 8. Our robot will be a simple point finger and
306+
the goal will be to flip over the box. You can find the code used to train the policy <a href="https://github.com/RussTedrake/manipulation/blob/master/book/rl/train_boxflipup.py">here</a>.
307+
<ol type="a">
308+
<li> Take a look at the <a href="https://github.com/RussTedrake/manipulation/blob/master/manipulation/envs/box_flipup.py">code</a> used to generate the environment.
309+
Let $\theta$ denote the angle of the box from the vertical, $\omega$ denote the angular velocity of the box, $q_f$ denote the observed position of the finger,
310+
$v_f$ denote the velocity of the finger, and $u_f$ denote the commanded position of the finger. What is the reward function used here to train the policy?
311+
Write it down mathematically (use the modulo operator to handle the wrap-around of the angle). What do the individual terms in the reward
312+
function represent? Why do they make sense?</li>
313+
<li> Although we will not go into the exact details of how PPO works here, it works quite similarly to REINFORCE but using both (i) a learned value function to reduce variance, and (ii) an approximate
314+
objective, along with a trust-region constraint by clipping the per-sample loss to ensure that the policy is not updated too much at each step. Briefly explain why you think that
315+
(a) PPO might be more stable and sample efficient than REINFORCE, and (b) how you might expect PPO to perform on the box flipping task if the clipping limits are set to be too small or too large.</li>
316+
<li> We've trained a PPO-based policy to flip the box for 3,000,000 steps (see
317+
<a href="https://youtube.com/playlist?list=PLOZK7fx6sI6lvDalINByA_kbYfBxn76by&si=XUndO5UDHMJcywr2">here</a> for videos of the
318+
policy in action at each of the timesteps). How does the policy perform as the number of steps increases? Write qualitatively
319+
how the policy changes over time and which parts of the reward function are having the greatest effect at each step.
320+
</li>
321+
</ol>
322+
Notice how much time it takes to train a working policy, even for a simple manipulation problem like the 2D box flipping example with a point finger and a dense reward.
323+
Harder problems in manipulation (such as pick and place) can become extremely challenging to train naïvely with Reinforcement Learning, especially with
324+
sparse rewards such as in typical pick and place tasks where you only receive a reward when the object has been picked or placed in the right location. On the other hand,
325+
reinforcement learning can work well in contact-rich settings (as in the box flipping example); see <a href="https://www.youtube.com/watch?v=x4O8pojMF0w">RL solving a rubik's cube with one hand</a>
326+
for an example of RL being used to solve a contact-rich manipulation task (note this also depended heavily on things like domain randomization, curriculum learning, large scale compute, etc.).
327+
The story in locomotion, on the other hand, seems to be quite different, perhaps because it is easier to design dense rewards and to automate resets in simulation.
328+
</exercise>
302329
</section>
303330

304331
</chapter>

0 commit comments

Comments
 (0)