Llama2-7B + TruthfulQA reproduce issue #18

JiwenJ · 2024-01-14T13:31:49Z

Hello~ @pratyushasharma. Thanks for your effort and the code, I have been reproducing the result of Llama2-7B + TruthfulQA based on your code so that I can use your work as my baseline for further research, but I found that the results (i.e. accuracy) were almost the same, which is around 56.52 especially for the base model. I do not know what is wrong and I am still confused why that causes so much accuracy increase in Llama2-7B + TruthfulQA (around 5.7% in your result). I will appreciate it if you can help me check this result

dkmisra · 2024-01-14T17:02:12Z

Hi @JiwenJ , what command did you run? The base model should be deterministic.

dkmisra · 2024-01-14T17:57:07Z

Hi @JiwenJ , I just reproduced the base results reported in the paper. The code I use is:

python3 intervention_llama2_truthfulqa.py --lname dont --model_path <model-path-to-Llama>

Note that lname dont means dont do any intervention. The truthfulqa dataset I use has 5882 examples - we convert the original multiple choice QA to a binary true/false question for each (question, answer) pair.

Additionally, note that I am using Llama-2-7b-hf model. Could by any chance you are using a different Llama2 version? Can you tell me if your Llama2 is quantized or not-quantized? I need to check if the one I am using is.

Finally, as noted on this issue, #15 the 56.2 accuracy for truthfulQA is the accuracy of predicting the most common label for all questions. So, it is rather weird that the LLMs gets accuracy below 56.2 and that achieving this accuracy isn't a big deal. We will try to evaluate LASER on more powerful LLMs to see if we can get beyond this value.

JiwenJ · 2024-01-15T02:05:35Z

Thank you for your quick response, I use the dont parameter and dataset_size==5882. My original Llama-2-7b-hf use bf16 dtype. When I reproduce, I use the half precision i.e., model = LlamaForCausalLM.from_pretrained(llm_path).half(). I will check if my model is an official one.

dkmisra · 2024-01-15T02:13:26Z

I am not doing this:

When I reproduce, I use the half precision i.e., model = LlamaForCausalLM.from_pretrained(llm_path).half(). I will check if my model is an official one.

so my guess is that this is the main point of difference. Can you try disabling this line and trying again? Please see how we use the code here: https://github.com/pratyushasharma/laser/blob/main/src/intervention_llama2_truthfulqa.py

I am not able to share the model directly, but if you tell me probabilities of some words in a context, then I can double check with what I have, to check if our models are the same.

On a more researchy note, I wouldn't be surprised if quantized LLM has some flavor of LASER cause you are also doing compression. So, if quantized Llama performs better than normal Llama, that would be cool! Anyway, let me know what you find.

JiwenJ · 2024-01-15T02:51:23Z

OK, I will do some comparative analysis and get model from meta to reproduce again. I will post here if I find something.

dkmisra · 2024-01-15T03:12:22Z

Sounds good. Maybe you just need to replace

model = LlamaForCausalLM.from_pretrained(llm_path).half()

by

model = LlamaForCausalLM.from_pretrained(llm_path)

JiwenJ · 2024-01-18T17:31:08Z

Sry for my delayed respone, I know model = LlamaForCausalLM.from_pretrained(llm_path), but I was stuck in model.to(self.device) as RTX-4090 can not load llama2-7B in a single gpu, and model.to(self.device) do not support multiple GPUs loading. Anyway, I downloaded a model from meta and reproduced your result. Thank you for the help you've provided.

dkmisra · 2024-01-19T00:26:43Z

Glad to hear! I'll close this issue. Feel free to reopen it, if a followup question comes. All the best.

dkmisra added the bug Something isn't working label Jan 14, 2024

dkmisra self-assigned this Jan 14, 2024

dkmisra added question Further information is requested and removed bug Something isn't working labels Jan 14, 2024

dkmisra closed this as completed Jan 19, 2024

dkmisra mentioned this issue May 20, 2024

Reproducing LLAMA-2 metrics #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama2-7B + TruthfulQA reproduce issue #18

Llama2-7B + TruthfulQA reproduce issue #18

JiwenJ commented Jan 14, 2024

dkmisra commented Jan 14, 2024

dkmisra commented Jan 14, 2024 •

edited

Loading

JiwenJ commented Jan 15, 2024

dkmisra commented Jan 15, 2024

JiwenJ commented Jan 15, 2024

dkmisra commented Jan 15, 2024

JiwenJ commented Jan 18, 2024

dkmisra commented Jan 19, 2024

Llama2-7B + TruthfulQA reproduce issue #18

Llama2-7B + TruthfulQA reproduce issue #18

Comments

JiwenJ commented Jan 14, 2024

dkmisra commented Jan 14, 2024

dkmisra commented Jan 14, 2024 • edited Loading

JiwenJ commented Jan 15, 2024

dkmisra commented Jan 15, 2024

JiwenJ commented Jan 15, 2024

dkmisra commented Jan 15, 2024

JiwenJ commented Jan 18, 2024

dkmisra commented Jan 19, 2024

dkmisra commented Jan 14, 2024 •

edited

Loading