RussTedrake
diff --git a/‎.github/workflows/main.yml‎
Lines changed: 9 additions & 1 deletion b/‎.github/workflows/main.yml‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎book/chapters.js‎
Lines changed: 1 addition & 1 deletion b/‎book/chapters.js‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎book/figures/sam/mask_with_bounding_box_overlay.png‎
424 KB b/‎book/figures/sam/mask_with_bounding_box_overlay.png‎
424 KB
diff --git a/‎book/figures/sam/output_segmentation_masks.png‎
173 KB b/‎book/figures/sam/output_segmentation_masks.png‎
173 KB
diff --git a/‎book/figures/sam/raw_img.png‎
280 KB b/‎book/figures/sam/raw_img.png‎
280 KB
diff --git a/‎book/rl.html‎
Lines changed: 27 additions & 0 deletions b/‎book/rl.html‎
Lines changed: 27 additions & 0 deletions
@@ -51,14 +51,22 @@ jobs:
   check-approval:
     name: Check PR approval status
     runs-on: ubuntu-latest
-    if: github.event_name == 'pull_request_target'
+    if: github.event_name == 'pull_request_target' || github.event_name == 'push' || github.event_name == 'schedule'
     outputs:
       should-run: ${{ steps.check.outputs.should-run }}
       is-trusted: ${{ steps.check.outputs.is-trusted }}
     steps:
       - name: Check approval status
         id: check
         run: |
+          # For push events (commits to master) and scheduled runs, always run
+          if [[ "${{ github.event_name }}" != "pull_request_target" ]]; then
+            echo "is-trusted=true" >> $GITHUB_OUTPUT
+            echo "should-run=true" >> $GITHUB_OUTPUT
+            echo "✅ Push or scheduled event - running tests automatically"
+            exit 0
+          fi
+
           # Check if this is a trusted contributor
           if [[ "${{ github.event.pull_request.author_association }}" == "COLLABORATOR" ]] || \
              [[ "${{ github.event.pull_request.author_association }}" == "MEMBER" ]] || \
 
@@ -299,6 +299,33 @@ <h1>Using a Drake simulation as an Gym environment</h1>
         <li> Implement the advantage function. </li>
       </ol>
     </exercise>
+    <exercise id="rl-box-flipup"><h1>Analyzing Box Flipping with RL</h1>
+      In this exercise, you will analyze the behavior of a <a href="https://arxiv.org/abs/1707.06347">PPO</a> policy trained to flip over a box. Like REINFORCE, PPO is a policy-gradient method that
+      directly optimizes the policy parameters to maximize the value function. In order to have an easier problem to analyze,
+      we'll use the <a href="https://manipulation.csail.mit.edu/force.html#force_flip_up">box flipup example</a> from Chapter 8. Our robot will be a simple point finger and 
+      the goal will be to flip over the box. You can find the code used to train the policy <a href="https://github.com/RussTedrake/manipulation/blob/master/book/rl/train_boxflipup.py">here</a>.
+      <ol type="a">
+        <li> Take a look at the <a href="https://github.com/RussTedrake/manipulation/blob/master/manipulation/envs/box_flipup.py">code</a> used to generate the environment. 
+          Let $\theta$ denote the angle of the box from the vertical, $\omega$ denote the angular velocity of the box, $q_f$ denote the observed position of the finger, 
+          $v_f$ denote the velocity of the finger, and $u_f$ denote the commanded position of the finger. What is the reward function used here to train the policy? 
+          Write it down mathematically (use the modulo operator to handle the wrap-around of the angle). What do the individual terms in the reward
+          function represent? Why do they make sense?</li>
+        <li> Although we will not go into the exact details of how PPO works here, it works quite similarly to REINFORCE but using both (i) a learned value function to reduce variance, and (ii) an approximate
+          objective, along with a trust-region constraint by clipping the per-sample loss to ensure that the policy is not updated too much at each step. Briefly explain why you think that 
+          (a) PPO might be more stable and sample efficient than REINFORCE, and (b) how you might expect PPO to perform on the box flipping task if the clipping limits are set to be too small or too large.</li>
+        <li> We've trained a PPO-based policy to flip the box for 3,000,000 steps (see 
+          <a href="https://youtube.com/playlist?list=PLOZK7fx6sI6lvDalINByA_kbYfBxn76by&si=XUndO5UDHMJcywr2">here</a> for videos of the
+          policy in action at each of the timesteps). How does the policy perform as the number of steps increases? Write qualitatively
+          how the policy changes over time and which parts of the reward function are having the greatest effect at each step.
+        </li>
+      </ol>
+      Notice how much time it takes to train a working policy, even for a simple manipulation problem like the 2D box flipping example with a point finger and a dense reward. 
+      Harder problems in manipulation (such as pick and place) can become extremely challenging to train naïvely with Reinforcement Learning, especially with 
+      sparse rewards such as in typical pick and place tasks where you only receive a reward when the object has been picked or placed in the right location. On the other hand,
+      reinforcement learning can work well in contact-rich settings (as in the box flipping example); see <a href="https://www.youtube.com/watch?v=x4O8pojMF0w">RL solving a rubik's cube with one hand</a>
+      for an example of RL being used to solve a contact-rich manipulation task (note this also depended heavily on things like domain randomization, curriculum learning, large scale compute, etc.).
+      The story in locomotion, on the other hand, seems to be quite different, perhaps because it is easier to design dense rewards and to automate resets in simulation. 
+    </exercise>
   </section>
 
 </chapter>