Add RLHF guide and dummy demo with Keras/JAX #2117

TrailChai · 2025-06-02T16:06:51Z

This commit introduces a new example for Reinforcement Learning from Human Feedback (RLHF).

It includes:

`examples/rl/rlhf_dummy_demo.py`: A Python script demonstrating a simple RLHF loop with a dummy environment, a policy model, and a reward model, using Keras with the JAX backend.
`examples/rl/md/rlhf_dummy_demo.md`: A Markdown guide explaining the RLHF concept and the implementation details of the demo script.
`examples/rl/README.md`: A new README for the RL examples section, now including the RLHF demo.

Note: The Python demo script (`rlhf_dummy_demo.py`) currently experiences timeout issues during the training loop in the development environment, even with significantly reduced computational load. This is documented in the guide and README. The code serves as a structural example of implementing the RLHF components.

This commit introduces a new example for Reinforcement Learning from Human Feedback (RLHF). It includes: - \`examples/rl/rlhf_dummy_demo.py\`: A Python script demonstrating a simple RLHF loop with a dummy environment, a policy model, and a reward model, using Keras with the JAX backend. - \`examples/rl/md/rlhf_dummy_demo.md\`: A Markdown guide explaining the RLHF concept and the implementation details of the demo script. - \`examples/rl/README.md\`: A new README for the RL examples section, now including the RLHF demo. Note: The Python demo script (\`rlhf_dummy_demo.py\`) currently experiences timeout issues during the training loop in the development environment, even with significantly reduced computational load. This is documented in the guide and README. The code serves as a structural example of implementing the RLHF components.

divyashreepathihalli

Thanks Yasir! just keep the .py file for now. once it is approved we can generate the .ipynb files

examples/rl/README.md

examples/rl/md/rlhf_dummy_demo.md

divyashreepathihalli · 2025-06-03T18:20:02Z

examples/rl/rlhf_dummy_demo.py

+            policy_model_params["non_trainable"], 
+            state_input
+        )
+        actual_predictions_tensor = predictions_tuple[0] 


are we assuming batch size is 1?

Yes. The code as written assumes a batch size of 1 for all model inputs and gradient calculations

why is that?

This was a simplification for the demo's clarity and to manage complexity, especially since REINFORCE-style updates can be done on single trajectories. A more advanced setup would definitely use batching across multiple episodes or from a replay buffer for stability and efficiency.

Does that make sense in the context of this simplified demo?

divyashreepathihalli · 2025-06-03T18:21:56Z

examples/rl/rlhf_dummy_demo.py

+            episode_policy_losses.append(current_policy_loss)
+            policy_grads_step = policy_grads_dict_step["trainable"]
+            # Accumulate policy gradients
+            for i, grad in enumerate(policy_grads_step):


For potentially improving performance
policy_grads_accum = jax.tree_map(lambda acc, new: acc + new if new is not None else acc, policy_grads_accum, policy_grads_step)

Thanks, I have refactored policy gradients and reward gradients accumulations using jax.tree_map.

divyashreepathihalli · 2025-06-03T18:31:48Z

examples/rl/md/rlhf_dummy_demo.md

+        actual_predictions_tensor = predictions_tuple[0]
+        action_probs = actual_predictions_tensor[0]
+        log_prob = jnp.log(action_probs[action] + 1e-7)
+        return -log_prob * predicted_reward_value_stopped


If this predicted_reward_value is just R(s,a), then it's using the immediate predicted reward.
it's a very naive and generally ineffective way to train a policy. Thee log_prob should be multiplied by the cumulative discounted future reward (Return, G_t).

Commit c212593 addresses this.

…rewards. This commit refines the RLHF demo example (`examples/rl/rlhf_dummy_demo.py`) to use discounted cumulative actual rewards (G_t) for policy gradient calculations, aligning it with the REINFORCE algorithm. Changes include: - Added a `calculate_discounted_returns` helper function. - Modified the `rlhf_training_loop` to collect trajectories (states, actions, rewards) and compute G_t for each step at the end of an episode. - Updated the policy loss function to use these G_t values instead of immediate predicted rewards. - The reward model training logic remains focused on predicting immediate rewards based on simulated human feedback (environment reward in this demo). - Updated the corresponding RLHF guide (`examples/rl/md/rlhf_dummy_demo.md`) to explain these changes and provide updated code snippets. The timeout issues with the script in the development environment persist, but the code now better reflects a standard policy gradient approach.

divyashreepathihalli · 2025-06-03T23:58:59Z

the .md files are automatically generated. SO you might want to move the explanation content part to .py

Moving the explanation piece to .py from .md

TrailChai · 2025-06-10T12:59:00Z

I have deleted the .md files and added relevant documentation pieces to .py file

github-actions bot assigned sachinprasadhs Jun 2, 2025

divyashreepathihalli reviewed Jun 3, 2025

View reviewed changes

TrailChai and others added 7 commits June 3, 2025 11:35

Delete examples/rl/README.md

af42d5f

Update and rename rlhf_dummy_demo.md to rlhf_demo.md

752829f

Update rlhf_demo.md

87d37f1

Update rlhf_dummy_demo.py

8e7711c

Delete examples/rl/README.md

2179fa4

Update rlhf_dummy_demo.py

dfea315

TrailChai requested a review from divyashreepathihalli June 3, 2025 22:11

TrailChai added 3 commits June 3, 2025 15:57

Update and rename rlhf_dummy_demo.md to rlhf_demo.md

637c615

Rename rlhf_dummy_demo.py to rlhf_demo.py

2e80261

Update rlhf_demo.md

4179b20

TrailChai added 2 commits June 6, 2025 23:06

Update rlhf_demo.py

0e97fa0

Moving the explanation piece to .py from .md

Delete examples/rl/md/rlhf_demo.md

30694ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add RLHF guide and dummy demo with Keras/JAX #2117

Add RLHF guide and dummy demo with Keras/JAX #2117

Uh oh!

TrailChai commented Jun 2, 2025

Uh oh!

divyashreepathihalli left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

divyashreepathihalli Jun 3, 2025

Uh oh!

TrailChai Jun 3, 2025

Uh oh!

divyashreepathihalli Jun 3, 2025

Uh oh!

TrailChai Jun 4, 2025

Uh oh!

divyashreepathihalli Jun 3, 2025

Uh oh!

TrailChai Jun 3, 2025 •

edited

Loading

Uh oh!

divyashreepathihalli Jun 3, 2025

Uh oh!

TrailChai Jun 4, 2025

Uh oh!

divyashreepathihalli commented Jun 3, 2025

Uh oh!

TrailChai commented Jun 10, 2025

Uh oh!

Uh oh!

Add RLHF guide and dummy demo with Keras/JAX #2117

Are you sure you want to change the base?

Add RLHF guide and dummy demo with Keras/JAX #2117

Uh oh!

Conversation

TrailChai commented Jun 2, 2025

Uh oh!

divyashreepathihalli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

divyashreepathihalli Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

TrailChai Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

TrailChai Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

TrailChai Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

TrailChai Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli commented Jun 3, 2025

Uh oh!

TrailChai commented Jun 10, 2025

Uh oh!

Uh oh!

TrailChai Jun 3, 2025 •

edited

Loading