Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing LLAMA-2 metrics #27

Closed
sidhantls opened this issue May 20, 2024 · 2 comments
Closed

Reproducing LLAMA-2 metrics #27

sidhantls opened this issue May 20, 2024 · 2 comments

Comments

@sidhantls
Copy link
Contributor

sidhantls commented May 20, 2024

Hello,

I'm trying to reproduce metrics in Table 1 for LLAMA-2. I did so for GPT-J, and the results are consistent; however, for LLAMA-2 for some reason, the results are not matching. Any idea of why this is happening?

For LLAMA-2, Fever, I get:

  1. Baseline (no laser): 54.98% accuracy. The paper shows 59.3%
  2. With LASER: 54.13% accuracy

Logs:

  1. Baseline (no laser): 54.98% accuracy. The paper shows 59.3%
    python intervention_llama2_fever.py --lname dont --rate 8.0 --lnum 30 --home_dir out_data/fever --model_path meta-llama/Llama-2-7b-chat-hf

Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.98242396454226 percentage, Mean F1 score is None, Mean Log Prob is -1.1887680674259296, top-1 accuracy is 54.82958887360538, top-10 accuracy is 99.99235824545316, top-5 accuracy is 99.92358245453156.

  1. With LASER: 54.13% accuracy
    python intervention_llama2_fever.py --lname fc_in --rate 8.0 --lnum 30 --home_dir out_data/fever --model_path meta-llama/Llama-2-7b-chat-hf

Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.13418920984258 percentage, Mean F1 score is None, Mean Log Prob is -1.2900288283587429, top-1 accuracy is 54.09598043710836, top-10 accuracy is 100.0, top-5 accuracy is 99.91594069998472.

Specs:
Python==3.8
Torch: Version: 1.12.1+cu116

@dkmisra
Copy link
Collaborator

dkmisra commented May 20, 2024

My guess is that you are using a different Llama2 7B version. See this issue #18

If you want to use this Llama 2 version, then try to run a grid search over LASER values as the optimal values could be different from the ones in the paper.

@sidhantls
Copy link
Contributor Author

Thank you. Using meta-llama/Llama-2-7b-hf, the results are now consistent for fever - 59.131 (baseline) 65.558 (laser)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants