Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples?

In this work, we evaluate methods using output probabilities, internal causal-agnostic features, internal causal features to predict correctness of LLM outputs. We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.

The hypothesized correspondence between internal mechanisms and generalization behaviors. In this work, we focus on the prediction direction.

Dataset

We release a dataset of five correctness prediction tasks. Given a task input and an LLM, the goal is to predict whether the LLM output is correct.

Tasks

The five tasks cover symbol manipulation, knowledge retrieval, and instruction following, as shown below.

Task Type	Have Known Internal Mechanisms	Task Names
Symbol manipulation	Fully known	Indirect Object Identification (IOI); PriceTag
Knowledge retrieval	Partially known	RAVEL; MMLU
Instruction following	Partially known	Unlearn Harry Potter

Data format

Each JSON file represents one fold of a task, structured as follows:

{
  "train" : {
    "correct": [
      "prompt_0",
      "prompt_1",
      ...
    ],
    "wrong": [
      "prompt_0",
      "prompt_1",
      ...
    ]
  },
  "val": {
    ...
  },
  "test": {
    ...
  }
}

We release the prompts used in our experiment, where the "correct" and "wrong" labels are determined using Llama-3-8B-Instruct as the target model.

If you are using these tasks to predict behaviors of a different target model, you need to regenerate the correctness label of these prompts.

Methods

We evaluate four correctness prediction methods, categorized by the type of features they use.

Method	Feature Type	Requires Training	Requires Wrong Samples	Requires Counterfactuals	Requires Decoding
Confidence Score	Output probabilities	✗	✗	✗	✓
Correctness Probing	Internal causal-agnostic features	✓	✓	✗	Maybe
Counterfactual Simulation	Internal causal features	Localization only	✗	✓	✓
Value Probing	Internal causal features	✓	✗	Localization only	Maybe

Demo

We provide a demo evaluating each method on the MMLU correctness prediction task.

Citation

If you use the content of this repo, please kindly consider citing the following work

@inproceedings{
huang2025internal,
title={Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors},
author={Jing Huang and Junyi Tao and Thomas Icard and Diyi Yang and Christopher Potts},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=Ofa1cspTrv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
figures		figures
models		models
src		src
LICENSE		LICENSE
README.md		README.md
mmlu_ood_prediction_demo.ipynb		mmlu_ood_prediction_demo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Dataset

Tasks

Data format

Methods

Demo

Citation

About

Uh oh!

Languages

License

explanare/ood-prediction

Folders and files

Latest commit

History

Repository files navigation

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Dataset

Tasks

Data format

Methods

Demo

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages