Post Process Experience with Customizable Modes #1768

Haichao-Zhang · 2025-05-07T21:32:53Z

Post Process Experience
preprocess_unroll_experience feature, useful for a number of scenarios. For example:

many kinds of trajectory labeling (e.g. hindsight success relabeling)
trajectory filtering (e.g. excluding some trajectories from being stored into the buffer).

In some cases, it is also a necessary component to use if we want to use in-algorithm procedures to overwrite quantities in the timestep and record it into the buffer (e.g. step type), so that all the other logics in ALF are fully respected (e.g.masking out the loss for the LAST step based on step type).

Synced Traning
Note that although one may think step-based filtering (e.g. excluding tasks) can also be done on the replay buffer side, the training dynamics are not the same.
This PR ensures synced training, meaning we won't do train step for those invalid/to be excluded steps.
In contrast, replay buffer based filtering cannot ensure synced training.

Customizable Modes
The behavior could be customized by the user. Some examples:
(1) per-step saving without delay: saving each step of unroll experience into the replay buffer as we get it.
(2) all-step saving with delay: saving all the steps of unroll experience into the replay buffer with delay. This can happen in the case where we want to annotate an trajectory based on some quantities that are not immediately available in the current step (e.g. task success/failure).
(3) selective saving: exclude some of the unroll experiences and only save the rest. This could be useful in the case where there are transitions that are irrelevant to the training (e.g. in the multi-task case, where we want to exclude data from certain subtasks).

runjerry · 2025-05-09T04:25:56Z

alf/algorithms/algorithm.py

    @common.mark_replay
-    def train_from_replay_buffer(self, update_global_counter=False):
+    def train_from_replay_buffer(self,
+                                 effective_unroll_steps,


Add docstring for this arg?

This arg is now removed

runjerry · 2025-05-09T04:27:50Z

alf/algorithms/rl_algorithm.py

                config, self._num_earliest_frames_ignored)

+            if self._episodic_annotation:
+                assert self._env.batch_size == 1, "only support non-batched environment"


Add this to the docstring of episodic_annotation?

The assertion is not necessary here so remove also

runjerry · 2025-05-09T04:29:55Z

alf/algorithms/rl_algorithm.py

+        """A function that determines whether the ``post_process_episode`` function should
+        be applied to the current list of experiences.
+        """


This is an interface mainly used for subclasses? Maybe mention this. Same for post_process_episode.

Good point. Added comments. Also for post_process_episode

runjerry · 2025-05-09T04:38:53Z

alf/algorithms/rl_algorithm.py

+                    self._cached_exp)
+                effective_number_of_unroll_steps = len(annotated_exp_list)
+                # 2) observe
+                if not self.on_policy:


Maybe this condition check should be performed earlier, since it seems a waste to do all the post_process_episode if self.on_policy?

runjerry · 2025-05-09T04:48:48Z

alf/algorithms/algorithm.py

+                < config.initial_collect_steps) or (effective_unroll_steps
+                                                    == 0):


Is there any situation that train_from_replay_buffer will be called with effective_unroll_steps=0?

effective_unroll_steps is now removed from this function

runjerry · 2025-05-09T05:10:36Z

alf/algorithms/rl_algorithm.py

+        for i in range(effective_unroll_steps):
+            steps += self.train_from_replay_buffer(effective_unroll_steps=1,
+                                                   update_global_counter=True)
+            if unrolled:
+                with record_time("time/after_train_iter"):
+                    self.after_train_iter(root_inputs, rollout_info)


I feel that this update fundamentally changes the off-policy update logic w.r.t. its actual unroll in the env. Previously, between every call of self._unroll_iter_off_policy, the policy gets an "update" from self.train_from_replay_buffer. Now if self._episodic_annotation, policy training only happens after each episode, though the UTD stays the same. I feel that the episodic annotation function should be configurable independently of the choice of such unroll/update logic. Ideally, we may want to keep the previous version here while achieving the same effect of the change of above lines by configuring unroll_length and num_updates_per_train_iter.

If self._episodic_annotation is False, everything is the same as before.
If self._episodic_annotation is True, by default (with the new commit), also reduces to the original logic, so everything is the same after before (policy training only happens after each time step, not after each episode)

In the derived class, it is up to the user for determining what kind of annotation function he/she wants to implement and use.

runjerry · 2025-05-10T00:25:33Z

alf/algorithms/rl_algorithm.py

                              alf.layers.to_float32(policy_state))
        effective_number_of_unroll_steps = 1
        if self._episodic_annotation:
+            assert not self.on_policy, "only support episodic annotation for off policy training"


Maybe assert this in the __init__ function?

runjerry · 2025-05-10T00:26:52Z

Thank you Haichao for addressing all my comments. Just one more minor question.

le-horizon

some high and low level comments if they make sense.

le-horizon · 2025-05-22T21:33:46Z

alf/algorithms/rl_algorithm.py

+                    self.observe_for_replay(exp)
+                store_exp_time = time.time() - t0
+                # clean up the exp cache
+                self._cached_exp = []


This seems to assume that all envs end on the same step? What if some envs are LAST, some are MID? cached_exp will be cleared even for those with MID steps?

Even when doing this for an env with batch_size 1, this annotation mode will delay experience from being stored into the replay buffer.

Ok to submit the change as is, but may need to do two things:

rename the feature to something like store_experience_on_episode_end, and document its behavior clearly in the docstr.
experience relabel should be done when reading data out of replay buffer as in hindsight relabel.

assert that batch_size is 1 when enabled.

Also, delaying train_step because of delayed experience storage can have unexpected side effects, e.g. if episodes are 100 steps long, and unroll once per train iter, then summary will only happen every 100 train iters. It will also shift the distribution of the data training sees due to the delay.

Overall I think doing this episode level relabeling at the DataTransformer stage, after reading from replay_buffer is perhaps a better way, and a cleaner way as well (less scattered code). That would require the replay buffer to keep track of episode begin and end, which I think it already does.

This seems to assume that all envs end on the same step? What if some envs are LAST, some are MID? cached_exp will be cleared even for those with MID steps?

There is no such assumption. It is totally up to the users to inject their own assumptions.
By default, the behavior is the same as before.
Sorry that the function names are a bit mis-leading and their role has been extended to handle per-step case as well. Changed the function names and added more comments.

Even when doing this for an env with batch_size 1, this annotation mode will delay experience from being stored into the replay buffer.

No it won't. By default, the behavior is the same as before.

Ok to submit the change as is, but may need to do two things:

rename the feature to something like store_experience_on_episode_end, and document its behavior clearly in the docstr.
The suggested name is not appropriate.

experience relabel should be done when reading data out of replay buffer as in hindsight relabel.

Different use cases. This is an alternative interface that can support more than pure relabeling (e.g. excluding data), which is not directly supported by the replay buffer hindsight relabel.

assert that batch_size is 1 when enabled.
There is no such assumption in the current PR. It is up to the user.

Also, delaying train_step because of delayed experience storage can have unexpected side effects, e.g. if episodes are 100 steps long, and unroll once per train iter, then summary will only happen every 100 train iters. It will also shift the distribution of the data training sees due to the delay.
There is no delay.

Overall I think doing this episode level relabeling at the DataTransformer stage, after reading from replay_buffer is perhaps a better way, and a cleaner way as well (less scattered code). That would require the replay buffer to keep track of episode begin and end, which I think it already does.
As explained, it is more than pure relabeling.

le-horizon · 2025-05-22T21:55:14Z

alf/algorithms/rl_algorithm.py

-            with record_time("time/after_train_iter"):
-                self.after_train_iter(root_inputs, rollout_info)
+        steps = 0
+        for i in range(effective_unroll_steps):


unroll_steps is the wrong name? It should be called unroll_iterations to indicate training iterations, not env steps?

also rename effective_number_of_unroll_steps to be effective_unroll_iters to be consistent. (i.e. remove "number_of_")

Thanks for the comments. Changed.

emailweixu · 2025-05-23T17:45:25Z

alf/algorithms/algorithm.py

                    experience, batch_info = self._replay_buffer.gather_all(
                        ignore_earliest_frames=True)
-                    num_updates = config.num_updates_per_train_iter
+                    num_updates = effective_num_updates_per_train_iter


why do you need to make this change?

not necessary anymore. removed

emailweixu · 2025-05-23T19:06:20Z

alf/algorithms/rl_algorithm.py

+        effective_unroll_iters = effective_unroll_steps // unroll_length
+        return experience, effective_unroll_iters
+
+    def should_post_process_experience(self, rollout_info,


This is unnecessary. We can always call post_process_experience

Correct. Remove this function

emailweixu · 2025-05-23T19:07:19Z

alf/algorithms/rl_algorithm.py

+            As another example, task filtering can be simply achieved by returning ``[]``
+            in ``post_process_experience`` for that particular task.
+        - per-episode processing: ``should_post_process_experience`` returns True on episode
+            end and ``post_process_experience`` can return a list of cached and processed


no need to mention "cached". It will confuse the user.

Updated the docstring.

emailweixu · 2025-05-23T21:01:36Z

alf/algorithms/rl_algorithm.py

-            with record_time("time/after_train_iter"):
-                self.after_train_iter(root_inputs, rollout_info)
+        steps = 0
+        for i in range(effective_unroll_iters):


it's possible the effective_unroll_iters is always smaller than 1 in the case of num_envs > 1.

Good point. Now also handles the fractional unroll case.

emailweixu · 2025-05-23T21:01:43Z

alf/algorithms/rl_algorithm.py

+            # 1) process
+            post_processed_exp_list = self.post_process_experience(
+                rollout_info, time_step.step_type, exp)
+            effective_unroll_steps = len(post_processed_exp_list)


In order really get this right, need to do: sum(exp.step_type.shape[0] for exp in post_processed_exp_list) / exp.step_type.shape[0]

Haichao-Zhang · 2025-05-23T23:47:48Z

Close for now since this is mostly used in a side project. Will reopen if this becomes a general feature for other usecases.

Haichao-Zhang · 2025-06-05T17:14:44Z

Reopen as seems other people are also trying to use this feature.

Haichao-Zhang · 2025-06-05T18:01:49Z

Pushed a commit 94a50bf to give the user the flexibility to customize effective_unroll_steps for achieving different effects.

hnyu · 2025-06-05T18:08:36Z

alf/algorithms/rl_algorithm.py

+        effective_unroll_iters = 1 if unroll_length == 0 else effective_unroll_steps // unroll_length
+        return experience, effective_unroll_iters
+
+    def post_process_experience(self, rollout_info, step_type: StepType,


This name is confusing with the existing function preprocess_experience which might suggest that this happens after that but in fact this happens before training.

changed to preprocess_unroll_experience

hnyu · 2025-06-06T18:26:38Z

alf/algorithms/rl_algorithm.py

        store_exp_time = 0.
        step_time = 0.
        max_step_time = 0.
+        effective_unroll_steps = 0


I think we lack a formal definition of "effective" in the code document.

Added more comments with examples, especially in preprocess_unroll_experience

hnyu · 2025-06-06T18:28:11Z

alf/algorithms/rl_algorithm.py

-        return experience
+        # if the input unroll_length is 0 (e.g. fractional unroll), then this it treated as
+        # an effective unroll iter
+        effective_unroll_iters = 1 if unroll_length == 0 else effective_unroll_steps // unroll_length


It's strange to call unroll "iter"? The original definition is that each training iter we have one unroll. So what does unroll iters mean in this context?

Added comments. One effective_unroll_iter refers to the unroll_length times of calling of rollout_step in the unroll phase.

hnyu · 2025-06-06T18:33:22Z

This PR ensures synced training, meaning we won't do train step for those invalid/to be excluded steps.

I think the name "synced training" is somewhat confusing to me. The second half of the sentence doesn't relate to synchronization?

Haichao-Zhang · 2025-06-06T19:53:35Z

This PR ensures synced training, meaning we won't do train step for those invalid/to be excluded steps.

I think the name "synced training" is somewhat confusing to me. The second half of the sentence doesn't relate to synchronization?

Yeah, this part of description is out-dated. Updated with new one.

Episodic annotation and synced training

00efea8

Haichao-Zhang requested review from emailweixu and runjerry May 7, 2025 21:33

runjerry reviewed May 9, 2025

View reviewed changes

Address comments

e4cdb81

runjerry reviewed May 10, 2025

View reviewed changes

Update async unroll

a05e8da

le-horizon reviewed May 22, 2025

View reviewed changes

emailweixu reviewed May 23, 2025

View reviewed changes

Update

9cfe6a5

Haichao-Zhang force-pushed the PR_episodic_annotation branch from 45c8321 to 9cfe6a5 Compare May 23, 2025 19:12

Address more comments

26ab09a

Haichao-Zhang force-pushed the PR_episodic_annotation branch from d89cb3e to 26ab09a Compare May 23, 2025 19:33

Haichao-Zhang changed the title ~~Episodic annotation and synced training~~ Post Process Experience May 23, 2025

Haichao-Zhang changed the title ~~Post Process Experience~~ Post Process Experience and Synced Training May 23, 2025

emailweixu reviewed May 23, 2025

View reviewed changes

Handle fractional unroll

734dae8

Haichao-Zhang closed this May 23, 2025

Haichao-Zhang reopened this Jun 5, 2025

Let user set effective_unroll_steps

94a50bf

Haichao-Zhang requested a review from hnyu June 5, 2025 18:04

hnyu reviewed Jun 6, 2025

View reviewed changes

Haichao-Zhang force-pushed the PR_episodic_annotation branch from 788dfb4 to c837043 Compare June 6, 2025 19:45

Haichao-Zhang changed the title ~~Post Process Experience and Synced Training~~ Post Process Experience and Corresponding train/unroll Ratio Adjustment Jun 6, 2025

Address comments

8fc3ff2

Haichao-Zhang force-pushed the PR_episodic_annotation branch from c837043 to 8fc3ff2 Compare June 6, 2025 19:49

Haichao-Zhang changed the title ~~Post Process Experience and Corresponding train/unroll Ratio Adjustment~~ Post Process Experience and Customizable Modes Jun 6, 2025

Haichao-Zhang changed the title ~~Post Process Experience and Customizable Modes~~ Post Process Experience with Customizable Modes Jun 6, 2025

		< config.initial_collect_steps) or (effective_unroll_steps
		== 0):

Post Process Experience with Customizable Modes #1768

Are you sure you want to change the base?

Post Process Experience with Customizable Modes #1768

Uh oh!

Conversation

Haichao-Zhang commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runjerry May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runjerry May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runjerry commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

le-horizon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Haichao-Zhang May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Haichao-Zhang May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Haichao-Zhang May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Haichao-Zhang commented May 23, 2025

Uh oh!

Haichao-Zhang commented May 7, 2025 •

edited

Loading

runjerry May 9, 2025 •

edited

Loading

runjerry May 9, 2025 •

edited

Loading

runjerry commented May 10, 2025 •

edited

Loading

Haichao-Zhang May 23, 2025 •

edited

Loading

Haichao-Zhang May 23, 2025 •

edited

Loading

Haichao-Zhang May 23, 2025 •

edited

Loading