-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama2-7B + TruthfulQA reproduce issue #18
Comments
Hi @JiwenJ , what command did you run? The base model should be deterministic. |
Hi @JiwenJ , I just reproduced the base results reported in the paper. The code I use is:
Note that Additionally, note that I am using Finally, as noted on this issue, #15 the 56.2 accuracy for truthfulQA is the accuracy of predicting the most common label for all questions. So, it is rather weird that the LLMs gets accuracy below 56.2 and that achieving this accuracy isn't a big deal. We will try to evaluate LASER on more powerful LLMs to see if we can get beyond this value. |
Thank you for your quick response, I use the |
I am not doing this:
so my guess is that this is the main point of difference. Can you try disabling this line and trying again? Please see how we use the code here: https://github.com/pratyushasharma/laser/blob/main/src/intervention_llama2_truthfulqa.py I am not able to share the model directly, but if you tell me probabilities of some words in a context, then I can double check with what I have, to check if our models are the same. On a more researchy note, I wouldn't be surprised if quantized LLM has some flavor of LASER cause you are also doing compression. So, if quantized Llama performs better than normal Llama, that would be cool! Anyway, let me know what you find. |
OK, I will do some comparative analysis and get model from meta to reproduce again. I will post here if I find something. |
Sounds good. Maybe you just need to replace
by
|
Sry for my delayed respone, I know |
Glad to hear! I'll close this issue. Feel free to reopen it, if a followup question comes. All the best. |
Hello~ @pratyushasharma. Thanks for your effort and the code, I have been reproducing the result of Llama2-7B + TruthfulQA based on your code so that I can use your work as my baseline for further research, but I found that the results (i.e. accuracy) were almost the same, which is around 56.52 especially for the base model. I do not know what is wrong and I am still confused why that causes so much accuracy increase in Llama2-7B + TruthfulQA (around 5.7% in your result). I will appreciate it if you can help me check this result
The text was updated successfully, but these errors were encountered: