Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama2-7B + TruthfulQA reproduce issue #18

Closed
JiwenJ opened this issue Jan 14, 2024 · 8 comments
Closed

Llama2-7B + TruthfulQA reproduce issue #18

JiwenJ opened this issue Jan 14, 2024 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@JiwenJ
Copy link

JiwenJ commented Jan 14, 2024

Hello~ @pratyushasharma. Thanks for your effort and the code, I have been reproducing the result of Llama2-7B + TruthfulQA based on your code so that I can use your work as my baseline for further research, but I found that the results (i.e. accuracy) were almost the same, which is around 56.52 especially for the base model. I do not know what is wrong and I am still confused why that causes so much accuracy increase in Llama2-7B + TruthfulQA (around 5.7% in your result). I will appreciate it if you can help me check this result

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 14, 2024

Hi @JiwenJ , what command did you run? The base model should be deterministic.

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 14, 2024

Hi @JiwenJ , I just reproduced the base results reported in the paper. The code I use is:

python3 intervention_llama2_truthfulqa.py --lname dont --model_path <model-path-to-Llama>

Note that lname dont means dont do any intervention. The truthfulqa dataset I use has 5882 examples - we convert the original multiple choice QA to a binary true/false question for each (question, answer) pair.

Additionally, note that I am using Llama-2-7b-hf model. Could by any chance you are using a different Llama2 version? Can you tell me if your Llama2 is quantized or not-quantized? I need to check if the one I am using is.

Finally, as noted on this issue, #15 the 56.2 accuracy for truthfulQA is the accuracy of predicting the most common label for all questions. So, it is rather weird that the LLMs gets accuracy below 56.2 and that achieving this accuracy isn't a big deal. We will try to evaluate LASER on more powerful LLMs to see if we can get beyond this value.

@dkmisra dkmisra added the bug Something isn't working label Jan 14, 2024
@dkmisra dkmisra self-assigned this Jan 14, 2024
@dkmisra dkmisra added question Further information is requested and removed bug Something isn't working labels Jan 14, 2024
@JiwenJ
Copy link
Author

JiwenJ commented Jan 15, 2024

Thank you for your quick response, I use the dont parameter and dataset_size==5882. My original Llama-2-7b-hf use bf16 dtype. When I reproduce, I use the half precision i.e., model = LlamaForCausalLM.from_pretrained(llm_path).half(). I will check if my model is an official one.

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 15, 2024

I am not doing this:

When I reproduce, I use the half precision i.e., model = LlamaForCausalLM.from_pretrained(llm_path).half(). I will check if my model is an official one.

so my guess is that this is the main point of difference. Can you try disabling this line and trying again? Please see how we use the code here: https://github.com/pratyushasharma/laser/blob/main/src/intervention_llama2_truthfulqa.py

I am not able to share the model directly, but if you tell me probabilities of some words in a context, then I can double check with what I have, to check if our models are the same.

On a more researchy note, I wouldn't be surprised if quantized LLM has some flavor of LASER cause you are also doing compression. So, if quantized Llama performs better than normal Llama, that would be cool! Anyway, let me know what you find.

@JiwenJ
Copy link
Author

JiwenJ commented Jan 15, 2024

OK, I will do some comparative analysis and get model from meta to reproduce again. I will post here if I find something.

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 15, 2024

Sounds good. Maybe you just need to replace

model = LlamaForCausalLM.from_pretrained(llm_path).half()

by

model = LlamaForCausalLM.from_pretrained(llm_path)

@JiwenJ
Copy link
Author

JiwenJ commented Jan 18, 2024

Sry for my delayed respone, I know model = LlamaForCausalLM.from_pretrained(llm_path), but I was stuck in model.to(self.device) as RTX-4090 can not load llama2-7B in a single gpu, and model.to(self.device) do not support multiple GPUs loading. Anyway, I downloaded a model from meta and reproduced your result. Thank you for the help you've provided.

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 19, 2024

Glad to hear! I'll close this issue. Feel free to reopen it, if a followup question comes. All the best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants