Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential improvements for evaluation #15

Open
BenjaminBossan opened this issue Jan 10, 2024 · 1 comment
Open

Potential improvements for evaluation #15

BenjaminBossan opened this issue Jan 10, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@BenjaminBossan
Copy link

Thanks for providing the code for this promising research. I'm looking forward to see how far this idea can be pushed. It would be especially cool if a heuristic could be found that applies this technique to multiple layers chosen in a way that works across different models.

When I investigated the results a bit closer and ran some of the benchmarks locally, I came across some potential issues. Specifically, I took a look at the BigBench-Epistemic Reasoning benchmark, but I suspect that others could also be affected. First of all, I noticed that the accuracy of the models without intervention was below 50% (Tab. 1). For a binary classification task, this is strange. When debugging the results, I found that for Roberta and GPT-J (haven't tested Llama), the models always predicted the same label, and since that label was used in 37% of samples in the dataset, that's also their accuracy. As Llama has 63% accuracy with intervention, I suspect that it simply always predicts other label.

Digging a bit deeper, I found the logits for the label tokens to be extremely small. This typically happens when the model is somehow "derailed" and wants to predict neither of the tokens. Sometimes, this simply comes down to tokenization: Often, the models try to predict " True" and " False" (leading whitespace) because this is how they tokenize the text. Other times, they want to go in a completely different direction. I would recommend to log the absolute probabilities of the label tokens and double-check when they are too low. Often, this can be fixed by slight adjustments to prompts or labels.

Also, there is a typo in this prompt: "entails" => "entail".

I hope this is helpful.

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 10, 2024

Hi @BenjaminBossan . Thanks for these comments.

Regarding your first comment, we have a stacking result in our paper where we apply these interventions layer by layer in a greedy fashion. See the paragraph on Composing reductions across layers in the paper here.

Your second comment sounds very important and interesting. I am logging the predicted labels and log-prob but I haven't inspected them manually. We have a few binary classification tasks: Fever, Bios Gender, TruthfulQA and BigBench Epistemic Reasoning. I calculated the class probabilities:

Fever: 49.95% / 50.05%
BiasGender: 53.88% / 46.12%
Truthful QA: 56.23% / 43.77%
Epistemic Reasoning: 37.12% / 62.88%

For BiasGender and Fever+Llama2, the gains seem to be non-trivial. For TruthfulQA and BigBench Epistemic reasoning, all models seem to be roughly the same as the best single-class prediction accuracy. The LASER does help here but maybe the corrections are more about just moving the bias. Let me check if your hypothesis on this is correct, and get back later today.

Thanks for the fix in the prompt typo. I'll make a note of it.

The issue of whitespace before tokens is also important. I made changes so that GPTJ predicts token with a whitespace and Llama2 doesnt but I will do another sanity check.

@dkmisra dkmisra added the question Further information is requested label Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants