RARR Unraveled: Component-Level Insights into
Hallucination Detection and Mitigation

Welcome to the repository for our reproducibility paper submission to SIGIR ’25. The original
paper by Gao et. al (2023) can be found here.

In our work, we critically examine RARR and adapt its framework to incorporate publicly available
evidence retrieval systems and generative models, thereby operationalizing the approach. We focus
on hallucination detection, analyzing how each pipeline component contributes to this task. We conduct
a sentence-level analysis of hallucinations to provide a more granular assessment of RARR’s performance,
identifying insights into RARRs strengths, limitations, and potential areas for improvement. Two key
findings are that query generation and retrieval are effective and that the agreement module is a weak
link in the RARR pipeline.

Setup Guide

Environment

conda env create -f environment.yaml

LLM Promps

Data Preprocessing

FAVABench

See fava_datasets.ipynb notebook.

Overview of steps:

'mark' annotations and their tagged content (e.g., <mark>...</mark>) are removed.
Annotations unrelated to hallucinations (e.g., <delete>…</delete>) are removed.
Responses are sentence-tokenized using Python’s SpaCy package, and sentences with obvious errors in their annotations (e.g., lone annotation as a sentence, leading end annotation, and dangling start annotation) are fixed.
Annotations that remain for each sentence (and do not span multiple sentences) are processed and removed.
Sentences were decontextualized as in Choi et al., 2021 using Gemini-1.5-pro-002. This step is performed by prompting an LLM to decontextualize the sentence, given its response.

This results in a dataset of 5,150 decontextualized sentences, each labeled with a count of hallucination types present.

WikiBib

See wiki_dataset.ipynb notebook.

Wikipedia

Preprocessing steps:

Chunked into 512-token passages.
Overlap of 32 tokens between passages.

Running Experiments

Configuration Files:
- Each experiment uses a configuration file located in the .configs/ directory.
- Configuration settings are organized by RARR components:
  - Query Generation (q)
  - Evidence Retrieval (r)
  - Agreement (a)
- The configuration filenames encode the component settings. For example, a filename like config_q1_r1_q1.yaml indicates that:
  - The first query generation experiment is being used.
  - The first evidence retrieval experiment is being used.
  - And so on.
Running an Experiment:
- After validating the configuration settings, run the experiment by specifying the configuration file path and the dataset name.
- For example:
```
python run_agreement_modular.py \
  --config_file_path "./configs/config_q6_r1_a1.yaml" \
  --dataset_name "fava"
```

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
input		input
notebooks		notebooks
prompts		prompts
retrieval		retrieval
utils		utils
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
few_shot.ipynb		few_shot.ipynb
hallucination_pipeline.pdf		hallucination_pipeline.pdf
hallucination_pipeline.png		hallucination_pipeline.png
run_agreement_modular.py		run_agreement_modular.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RARR Unraveled: Component-Level Insights into
Hallucination Detection and Mitigation

Setup Guide

Environment

LLM Promps

Data Preprocessing

FAVABench

WikiBib

Wikipedia

Running Experiments

About

Releases

Packages

Contributors 2

Languages

License

ielab/rarr-reproducibility

Folders and files

Latest commit

History

Repository files navigation

RARR Unraveled: Component-Level Insights into Hallucination Detection and Mitigation

Setup Guide

Environment

LLM Promps

Data Preprocessing

FAVABench

WikiBib

Wikipedia

Running Experiments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

RARR Unraveled: Component-Level Insights into
Hallucination Detection and Mitigation

Packages