You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@QZH-777@Xiao9905 Hi Zehan and Xiao, thanks for the great work! I have some questions regarding the evaluation of rollouts on SFT training set using WebArena-Lite reward function. As discussed in issue-24, $D_0$ is provided in WebArena-Lite_info.json, however there was no reference answer/built-in WebArena-Lite reward functions associated with $D_0$ for evaluation.
Am I missing something here? If not, can you explain a bit about how the evaluation process described in Line 3 of Algorithm 1 was done?
The text was updated successfully, but these errors were encountered:
Thanks for the response! However, I'm still a bit confused about the implication here: are you suggesting that you were using the ORM to evaluate $D_{rollout}$ in Line 3, instead of using the WebArena-Lite reward functions?
In that case, I think this would be inconsistent with what's described in Algorithm 1?
The training data for the ORM consists of rollouts from WebArena-Lite tasks, with success determined by the corresponding reward functions. The ORM can potentially serve as an alternative to the reward function. For certain reasons, we do not plan to publicly release the WebArena-Lite data along with the reward function at this time.
Thank you for the clarification and being transparent about the situation!
Looking forward to the potential future release of these resources, which would greatly help the community accurately reproduce the training process of WebRL and contribute to further extensions.
@QZH-777 @Xiao9905 Hi Zehan and Xiao, thanks for the great work! I have some questions regarding the evaluation of rollouts on SFT training set using WebArena-Lite reward function. As discussed in issue-24,$D_0$ is provided in WebArena-Lite_info.json, however there was no reference answer/built-in WebArena-Lite reward functions associated with $D_0$ for evaluation.
Am I missing something here? If not, can you explain a bit about how the evaluation process described in Line 3 of Algorithm 1 was done?
The text was updated successfully, but these errors were encountered: