Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you release the weights of PRM? #4

Open
cybisolated opened this issue Sep 30, 2024 · 7 comments
Open

Could you release the weights of PRM? #4

cybisolated opened this issue Sep 30, 2024 · 7 comments
Labels
about dataset datasets of PRM and policy model about PRM

Comments

@cybisolated
Copy link

Thanks for your contribution! Could you release the weights of PRM? Or maybe there is something I omit?

@zhangdan0602
Copy link
Collaborator

We have illustrated how to train PRM. Specifically, you can download [$D_{V_0}$] and put them in PRM/data to train Mistral-7B as the initial process reward model and obtain VALUE_MODEL_STATE_DICT.
We also provide PRM/train_VM_chatglm.py and PRM/train_VM_mistral.py.

@ImKeTT
Copy link

ImKeTT commented Nov 14, 2024

Great work, but is it possible to just release the model weight?

@jingjingchengcai
Copy link

jingjingchengcai commented Dec 14, 2024

image
I trained the PRM/train_VM_chatglm.py by following the instructions with 2 epochs. The accuracy I got is 0.1614. Is this expected? How many epochs should we use?

@jingjingchengcai
Copy link

I also trained PRM/train_VM_mistral.py and the accuracy is 0.1530 after two epochs.
I found one epoch actually gives better accuracy, i.e., 0.2247.
Without training, the accuracy is about 0.1182.
Did anyone get similar results?

@jingjingchengcai
Copy link

Thank you for your contributions! I’m currently stuck with training the VM.

Below are the statistics of the training data, showing each label and its corresponding number of samples:

Counter({'0.0': 240594, '1.0': 48953, '0.1': 20901, '0.8': 20614, '0.5': 18341, '0.2': 17034, '0.3': 16688, '0.7': 14310, '0.6': 13104, '0.4': 9448, '0.9': 6462})

If the RM predicts only label 0.0 regardless of the input text, the accuracy would be 240,594/426,449 = 0.56. Surprisingly, this is significantly higher than the accuracy achieved by fine-tuning the RM. I hope the author can help me identify what I might be doing wrong.

@zhangdan0602 zhangdan0602 added about dataset datasets of PRM and policy model about PRM labels Dec 25, 2024
@zhangdan0602
Copy link
Collaborator

The experimental settings are as follows:

For ChatGLM3-6B, learning rate (lr) is 2e-5, the number of epochs is 2 or 3, and batch size is 3.

For Mistral, learning rate (lr) is 3e-6, the number of epochs is 2 or 3, and batch size is 3.

@debajoycs98
Copy link

Can anyone let me know approximately how much time it required to run the 2 epochs for mistral on a A100. It shows me around 35 hrs!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
about dataset datasets of PRM and policy model about PRM
Projects
None yet
Development

No branches or pull requests

5 participants