Potential improvements for evaluation #15

BenjaminBossan · 2024-01-10T10:14:30Z

Thanks for providing the code for this promising research. I'm looking forward to see how far this idea can be pushed. It would be especially cool if a heuristic could be found that applies this technique to multiple layers chosen in a way that works across different models.

When I investigated the results a bit closer and ran some of the benchmarks locally, I came across some potential issues. Specifically, I took a look at the BigBench-Epistemic Reasoning benchmark, but I suspect that others could also be affected. First of all, I noticed that the accuracy of the models without intervention was below 50% (Tab. 1). For a binary classification task, this is strange. When debugging the results, I found that for Roberta and GPT-J (haven't tested Llama), the models always predicted the same label, and since that label was used in 37% of samples in the dataset, that's also their accuracy. As Llama has 63% accuracy with intervention, I suspect that it simply always predicts other label.

Digging a bit deeper, I found the logits for the label tokens to be extremely small. This typically happens when the model is somehow "derailed" and wants to predict neither of the tokens. Sometimes, this simply comes down to tokenization: Often, the models try to predict " True" and " False" (leading whitespace) because this is how they tokenize the text. Other times, they want to go in a completely different direction. I would recommend to log the absolute probabilities of the label tokens and double-check when they are too low. Often, this can be fixed by slight adjustments to prompts or labels.

Also, there is a typo in this prompt: "entails" => "entail".

I hope this is helpful.

dkmisra · 2024-01-10T15:43:54Z

Hi @BenjaminBossan . Thanks for these comments.

Regarding your first comment, we have a stacking result in our paper where we apply these interventions layer by layer in a greedy fashion. See the paragraph on Composing reductions across layers in the paper here.

Your second comment sounds very important and interesting. I am logging the predicted labels and log-prob but I haven't inspected them manually. We have a few binary classification tasks: Fever, Bios Gender, TruthfulQA and BigBench Epistemic Reasoning. I calculated the class probabilities:

Fever: 49.95% / 50.05%
BiasGender: 53.88% / 46.12%
Truthful QA: 56.23% / 43.77%
Epistemic Reasoning: 37.12% / 62.88%

For BiasGender and Fever+Llama2, the gains seem to be non-trivial. For TruthfulQA and BigBench Epistemic reasoning, all models seem to be roughly the same as the best single-class prediction accuracy. The LASER does help here but maybe the corrections are more about just moving the bias. Let me check if your hypothesis on this is correct, and get back later today.

Thanks for the fix in the prompt typo. I'll make a note of it.

The issue of whitespace before tokens is also important. I made changes so that GPTJ predicts token with a whitespace and Llama2 doesnt but I will do another sanity check.

dkmisra added the question Further information is requested label Jan 10, 2024

dkmisra assigned dkmisra and pratyushasharma Jan 10, 2024

dkmisra mentioned this issue Jan 14, 2024

Llama2-7B + TruthfulQA reproduce issue #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential improvements for evaluation #15

Potential improvements for evaluation #15

BenjaminBossan commented Jan 10, 2024

dkmisra commented Jan 10, 2024 •

edited

Loading

Potential improvements for evaluation #15

Potential improvements for evaluation #15

Comments

BenjaminBossan commented Jan 10, 2024

dkmisra commented Jan 10, 2024 • edited Loading

dkmisra commented Jan 10, 2024 •

edited

Loading