You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to reproduce metrics in Table 1 for LLAMA-2. I did so for GPT-J, and the results are consistent; however, for LLAMA-2 for some reason, the results are not matching. Any idea of why this is happening?
For LLAMA-2, Fever, I get:
Baseline (no laser): 54.98% accuracy. The paper shows 59.3%
With LASER: 54.13% accuracy
Logs:
Baseline (no laser): 54.98% accuracy. The paper shows 59.3% python intervention_llama2_fever.py --lname dont --rate 8.0 --lnum 30 --home_dir out_data/fever --model_path meta-llama/Llama-2-7b-chat-hf
Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.98242396454226 percentage, Mean F1 score is None, Mean Log Prob is -1.1887680674259296, top-1 accuracy is 54.82958887360538, top-10 accuracy is 99.99235824545316, top-5 accuracy is 99.92358245453156.
Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.13418920984258 percentage, Mean F1 score is None, Mean Log Prob is -1.2900288283587429, top-1 accuracy is 54.09598043710836, top-10 accuracy is 100.0, top-5 accuracy is 99.91594069998472.
Specs:
Python==3.8
Torch: Version: 1.12.1+cu116
The text was updated successfully, but these errors were encountered:
My guess is that you are using a different Llama2 7B version. See this issue #18
If you want to use this Llama 2 version, then try to run a grid search over LASER values as the optimal values could be different from the ones in the paper.
Hello,
I'm trying to reproduce metrics in Table 1 for LLAMA-2. I did so for GPT-J, and the results are consistent; however, for LLAMA-2 for some reason, the results are not matching. Any idea of why this is happening?
For LLAMA-2, Fever, I get:
Logs:
python intervention_llama2_fever.py --lname dont --rate 8.0 --lnum 30 --home_dir out_data/fever --model_path meta-llama/Llama-2-7b-chat-hf
Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.98242396454226 percentage, Mean F1 score is None, Mean Log Prob is -1.1887680674259296, top-1 accuracy is 54.82958887360538, top-10 accuracy is 99.99235824545316, top-5 accuracy is 99.92358245453156.
python intervention_llama2_fever.py --lname fc_in --rate 8.0 --lnum 30 --home_dir out_data/fever --model_path meta-llama/Llama-2-7b-chat-hf
Main: Msg: Final Performance: Dataset size 13086 0-1 Correctness is 54.13418920984258 percentage, Mean F1 score is None, Mean Log Prob is -1.2900288283587429, top-1 accuracy is 54.09598043710836, top-10 accuracy is 100.0, top-5 accuracy is 99.91594069998472.
Specs:
Python==3.8
Torch: Version: 1.12.1+cu116
The text was updated successfully, but these errors were encountered: