Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples?
In this work, we evaluate methods using output probabilities, internal causal-agnostic features, internal causal features to predict correctness of LLM outputs. We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.
The hypothesized correspondence between internal mechanisms and generalization behaviors. In this work, we focus on the prediction direction.
We release a dataset of five correctness prediction tasks. Given a task input and an LLM, the goal is to predict whether the LLM output is correct.
The five tasks cover symbol manipulation, knowledge retrieval, and instruction following, as shown below.
| Task Type | Have Known Internal Mechanisms | Task Names |
|---|---|---|
| Symbol manipulation | Fully known | Indirect Object Identification (IOI); PriceTag |
| Knowledge retrieval | Partially known | RAVEL; MMLU |
| Instruction following | Partially known | Unlearn Harry Potter |
Each JSON file represents one fold of a task, structured as follows:
{
"train" : {
"correct": [
"prompt_0",
"prompt_1",
...
],
"wrong": [
"prompt_0",
"prompt_1",
...
]
},
"val": {
...
},
"test": {
...
}
}We release the prompts used in our experiment, where the "correct" and "wrong" labels are determined using Llama-3-8B-Instruct as the target model.
If you are using these tasks to predict behaviors of a different target model, you need to regenerate the correctness label of these prompts.
We evaluate four correctness prediction methods, categorized by the type of features they use.
| Method | Feature Type | Requires Training | Requires Wrong Samples | Requires Counterfactuals | Requires Decoding |
|---|---|---|---|---|---|
| Confidence Score | Output probabilities | ✗ | ✗ | ✗ | ✓ |
| Correctness Probing | Internal causal-agnostic features | ✓ | ✓ | ✗ | Maybe |
| Counterfactual Simulation | Internal causal features | Localization only | ✗ | ✓ | ✓ |
| Value Probing | Internal causal features | ✓ | ✗ | Localization only | Maybe |
We provide a demo evaluating each method on the MMLU correctness prediction task.
If you use the content of this repo, please kindly consider citing the following work
@inproceedings{
huang2025internal,
title={Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors},
author={Jing Huang and Junyi Tao and Thomas Icard and Diyi Yang and Christopher Potts},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=Ofa1cspTrv}
}