-
Notifications
You must be signed in to change notification settings - Fork 58
Post Process Experience with Customizable Modes #1768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: pytorch
Are you sure you want to change the base?
Conversation
alf/algorithms/algorithm.py
Outdated
| @common.mark_replay | ||
| def train_from_replay_buffer(self, update_global_counter=False): | ||
| def train_from_replay_buffer(self, | ||
| effective_unroll_steps, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docstring for this arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This arg is now removed
alf/algorithms/rl_algorithm.py
Outdated
| config, self._num_earliest_frames_ignored) | ||
|
|
||
| if self._episodic_annotation: | ||
| assert self._env.batch_size == 1, "only support non-batched environment" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this to the docstring of episodic_annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion is not necessary here so remove also
alf/algorithms/rl_algorithm.py
Outdated
| """A function that determines whether the ``post_process_episode`` function should | ||
| be applied to the current list of experiences. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interface mainly used for subclasses? Maybe mention this. Same for post_process_episode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added comments. Also for post_process_episode
alf/algorithms/rl_algorithm.py
Outdated
| self._cached_exp) | ||
| effective_number_of_unroll_steps = len(annotated_exp_list) | ||
| # 2) observe | ||
| if not self.on_policy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this condition check should be performed earlier, since it seems a waste to do all the post_process_episode if self.on_policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
alf/algorithms/algorithm.py
Outdated
| < config.initial_collect_steps) or (effective_unroll_steps | ||
| == 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any situation that train_from_replay_buffer will be called with effective_unroll_steps=0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
effective_unroll_steps is now removed from this function
alf/algorithms/rl_algorithm.py
Outdated
| for i in range(effective_unroll_steps): | ||
| steps += self.train_from_replay_buffer(effective_unroll_steps=1, | ||
| update_global_counter=True) | ||
| if unrolled: | ||
| with record_time("time/after_train_iter"): | ||
| self.after_train_iter(root_inputs, rollout_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that this update fundamentally changes the off-policy update logic w.r.t. its actual unroll in the env. Previously, between every call of self._unroll_iter_off_policy, the policy gets an "update" from self.train_from_replay_buffer. Now if self._episodic_annotation, policy training only happens after each episode, though the UTD stays the same. I feel that the episodic annotation function should be configurable independently of the choice of such unroll/update logic. Ideally, we may want to keep the previous version here while achieving the same effect of the change of above lines by configuring unroll_length and num_updates_per_train_iter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If self._episodic_annotation is False, everything is the same as before.
If self._episodic_annotation is True, by default (with the new commit), also reduces to the original logic, so everything is the same after before (policy training only happens after each time step, not after each episode)
In the derived class, it is up to the user for determining what kind of annotation function he/she wants to implement and use.
alf/algorithms/rl_algorithm.py
Outdated
| alf.layers.to_float32(policy_state)) | ||
| effective_number_of_unroll_steps = 1 | ||
| if self._episodic_annotation: | ||
| assert not self.on_policy, "only support episodic annotation for off policy training" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe assert this in the __init__ function?
|
Thank you Haichao for addressing all my comments. Just one more minor question. |
le-horizon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some high and low level comments if they make sense.
alf/algorithms/rl_algorithm.py
Outdated
| self.observe_for_replay(exp) | ||
| store_exp_time = time.time() - t0 | ||
| # clean up the exp cache | ||
| self._cached_exp = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to assume that all envs end on the same step? What if some envs are LAST, some are MID? cached_exp will be cleared even for those with MID steps?
Even when doing this for an env with batch_size 1, this annotation mode will delay experience from being stored into the replay buffer.
Ok to submit the change as is, but may need to do two things:
-
rename the feature to something like
store_experience_on_episode_end, and document its behavior clearly in the docstr.
experience relabel should be done when reading data out of replay buffer as in hindsight relabel. -
assert that batch_size is 1 when enabled.
Also, delaying train_step because of delayed experience storage can have unexpected side effects, e.g. if episodes are 100 steps long, and unroll once per train iter, then summary will only happen every 100 train iters. It will also shift the distribution of the data training sees due to the delay.
Overall I think doing this episode level relabeling at the DataTransformer stage, after reading from replay_buffer is perhaps a better way, and a cleaner way as well (less scattered code). That would require the replay buffer to keep track of episode begin and end, which I think it already does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to assume that all envs end on the same step? What if some envs are LAST, some are MID? cached_exp will be cleared even for those with MID steps?
There is no such assumption. It is totally up to the users to inject their own assumptions.
By default, the behavior is the same as before.
Sorry that the function names are a bit mis-leading and their role has been extended to handle per-step case as well. Changed the function names and added more comments.
Even when doing this for an env with batch_size 1, this annotation mode will delay experience from being stored into the replay buffer.
No it won't. By default, the behavior is the same as before.
Ok to submit the change as is, but may need to do two things:
- rename the feature to something like
store_experience_on_episode_end, and document its behavior clearly in the docstr.
The suggested name is not appropriate.
experience relabel should be done when reading data out of replay buffer as in hindsight relabel.
Different use cases. This is an alternative interface that can support more than pure relabeling (e.g. excluding data), which is not directly supported by the replay buffer hindsight relabel.
- assert that batch_size is 1 when enabled.
There is no such assumption in the current PR. It is up to the user.
Also, delaying train_step because of delayed experience storage can have unexpected side effects, e.g. if episodes are 100 steps long, and unroll once per train iter, then summary will only happen every 100 train iters. It will also shift the distribution of the data training sees due to the delay.
There is no delay.
Overall I think doing this episode level relabeling at the DataTransformer stage, after reading from replay_buffer is perhaps a better way, and a cleaner way as well (less scattered code). That would require the replay buffer to keep track of episode begin and end, which I think it already does.
As explained, it is more than pure relabeling.
alf/algorithms/rl_algorithm.py
Outdated
| with record_time("time/after_train_iter"): | ||
| self.after_train_iter(root_inputs, rollout_info) | ||
| steps = 0 | ||
| for i in range(effective_unroll_steps): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unroll_steps is the wrong name? It should be called unroll_iterations to indicate training iterations, not env steps?
also rename effective_number_of_unroll_steps to be effective_unroll_iters to be consistent. (i.e. remove "number_of_")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. Changed.
alf/algorithms/algorithm.py
Outdated
| experience, batch_info = self._replay_buffer.gather_all( | ||
| ignore_earliest_frames=True) | ||
| num_updates = config.num_updates_per_train_iter | ||
| num_updates = effective_num_updates_per_train_iter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to make this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary anymore. removed
alf/algorithms/rl_algorithm.py
Outdated
| effective_unroll_iters = effective_unroll_steps // unroll_length | ||
| return experience, effective_unroll_iters | ||
|
|
||
| def should_post_process_experience(self, rollout_info, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary. We can always call post_process_experience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Remove this function
alf/algorithms/rl_algorithm.py
Outdated
| As another example, task filtering can be simply achieved by returning ``[]`` | ||
| in ``post_process_experience`` for that particular task. | ||
| - per-episode processing: ``should_post_process_experience`` returns True on episode | ||
| end and ``post_process_experience`` can return a list of cached and processed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to mention "cached". It will confuse the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the docstring.
45c8321 to
9cfe6a5
Compare
d89cb3e to
26ab09a
Compare
| with record_time("time/after_train_iter"): | ||
| self.after_train_iter(root_inputs, rollout_info) | ||
| steps = 0 | ||
| for i in range(effective_unroll_iters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible the effective_unroll_iters is always smaller than 1 in the case of num_envs > 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Now also handles the fractional unroll case.
alf/algorithms/rl_algorithm.py
Outdated
| # 1) process | ||
| post_processed_exp_list = self.post_process_experience( | ||
| rollout_info, time_step.step_type, exp) | ||
| effective_unroll_steps = len(post_processed_exp_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order really get this right, need to do: sum(exp.step_type.shape[0] for exp in post_processed_exp_list) / exp.step_type.shape[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
Close for now since this is mostly used in a side project. Will reopen if this becomes a general feature for other usecases. |
|
Reopen as seems other people are also trying to use this feature. |
|
Pushed a commit 94a50bf to give the user the flexibility to customize |
alf/algorithms/rl_algorithm.py
Outdated
| effective_unroll_iters = 1 if unroll_length == 0 else effective_unroll_steps // unroll_length | ||
| return experience, effective_unroll_iters | ||
|
|
||
| def post_process_experience(self, rollout_info, step_type: StepType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is confusing with the existing function preprocess_experience which might suggest that this happens after that but in fact this happens before training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to preprocess_unroll_experience
| store_exp_time = 0. | ||
| step_time = 0. | ||
| max_step_time = 0. | ||
| effective_unroll_steps = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we lack a formal definition of "effective" in the code document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more comments with examples, especially in preprocess_unroll_experience
| return experience | ||
| # if the input unroll_length is 0 (e.g. fractional unroll), then this it treated as | ||
| # an effective unroll iter | ||
| effective_unroll_iters = 1 if unroll_length == 0 else effective_unroll_steps // unroll_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's strange to call unroll "iter"? The original definition is that each training iter we have one unroll. So what does unroll iters mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments. One effective_unroll_iter refers to the unroll_length times of calling of rollout_step in the unroll phase.
I think the name "synced training" is somewhat confusing to me. The second half of the sentence doesn't relate to synchronization? |
788dfb4 to
c837043
Compare
c837043 to
8fc3ff2
Compare
Yeah, this part of description is out-dated. Updated with new one. |
Post Process Experience
preprocess_unroll_experience feature, useful for a number of scenarios. For example:
In some cases, it is also a necessary component to use if we want to use in-algorithm procedures to overwrite quantities in the timestep and record it into the buffer (e.g. step type), so that all the other logics in ALF are fully respected (e.g.masking out the loss for the LAST step based on step type).
Synced TraningNote that although one may think step-based filtering (e.g. excluding tasks) can also be done on the replay buffer side, the training dynamics are not the same.
This PR ensures synced training, meaning we won't do train step for those invalid/to be excluded steps.
In contrast, replay buffer based filtering cannot ensure synced training.
Customizable Modes
The behavior could be customized by the user. Some examples:
(1) per-step saving without delay: saving each step of unroll experience into the replay buffer as we get it.
(2) all-step saving with delay: saving all the steps of unroll experience into the replay buffer with delay. This can happen in the case where we want to annotate an trajectory based on some quantities that are not immediately available in the current step (e.g. task success/failure).
(3) selective saving: exclude some of the unroll experiences and only save the rest. This could be useful in the case where there are transitions that are irrelevant to the training (e.g. in the multi-task case, where we want to exclude data from certain subtasks).