Skip to content

explanare/ood-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples?

In this work, we evaluate methods using output probabilities, internal causal-agnostic features, internal causal features to predict correctness of LLM outputs. We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.

The hypothesized correspondence between internal mechanisms and generalization behaviors. In this work, we focus on the prediction direction.

Dataset

We release a dataset of five correctness prediction tasks. Given a task input and an LLM, the goal is to predict whether the LLM output is correct.

Tasks

The five tasks cover symbol manipulation, knowledge retrieval, and instruction following, as shown below.

Task Type Have Known Internal Mechanisms Task Names
Symbol manipulation Fully known Indirect Object Identification (IOI); PriceTag
Knowledge retrieval Partially known RAVEL; MMLU
Instruction following Partially known Unlearn Harry Potter

Data format

Each JSON file represents one fold of a task, structured as follows:

{
  "train" : {
    "correct": [
      "prompt_0",
      "prompt_1",
      ...
    ],
    "wrong": [
      "prompt_0",
      "prompt_1",
      ...
    ]
  },
  "val": {
    ...
  },
  "test": {
    ...
  }
}

We release the prompts used in our experiment, where the "correct" and "wrong" labels are determined using Llama-3-8B-Instruct as the target model.

If you are using these tasks to predict behaviors of a different target model, you need to regenerate the correctness label of these prompts.

Methods

We evaluate four correctness prediction methods, categorized by the type of features they use.

Method Feature Type Requires Training Requires Wrong Samples Requires Counterfactuals Requires Decoding
Confidence Score Output probabilities
Correctness Probing Internal causal-agnostic features Maybe
Counterfactual Simulation Internal causal features Localization only
Value Probing Internal causal features Localization only Maybe

Demo

We provide a demo evaluating each method on the MMLU correctness prediction task.

Open In Colab

Citation

If you use the content of this repo, please kindly consider citing the following work

@inproceedings{
huang2025internal,
title={Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors},
author={Jing Huang and Junyi Tao and Thomas Icard and Diyi Yang and Christopher Potts},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=Ofa1cspTrv}
}

About

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Topics

Resources

License

Stars

Watchers

Forks