Welcome to the repository for our reproducibility paper submission to SIGIR ’25. The original
paper by Gao et. al (2023) can be found here.
In our work, we critically examine RARR and adapt its framework to incorporate publicly available
evidence retrieval systems and generative models, thereby operationalizing the approach. We focus
on hallucination detection, analyzing how each pipeline component contributes to this task. We conduct
a sentence-level analysis of hallucinations to provide a more granular assessment of RARR’s performance,
identifying insights into RARRs strengths, limitations, and potential areas for improvement. Two key
findings are that query generation and retrieval are effective and that the agreement module is a weak
link in the RARR pipeline.
conda env create -f environment.yaml
- Query Generation
- Iterative Query Generation
- Agreement
- Query-Evidence Relevance Labeling
- Sentence-Query Quality Judgements
- Few Shot Hallucination Detection
See fava_datasets.ipynb notebook.
Overview of steps:
- 'mark' annotations and their tagged content (e.g.,
<mark>...</mark>
) are removed. - Annotations unrelated to hallucinations (e.g.,
<delete>…</delete>
) are removed. - Responses are sentence-tokenized using Python’s SpaCy package, and sentences with obvious errors in their annotations (e.g., lone annotation as a sentence, leading end annotation, and dangling start annotation) are fixed.
- Annotations that remain for each sentence (and do not span multiple sentences) are processed and removed.
- Sentences were decontextualized as in Choi et al., 2021 using Gemini-1.5-pro-002. This step is performed by prompting an LLM to decontextualize the sentence, given its response.
This results in a dataset of 5,150 decontextualized sentences, each labeled with a count of hallucination types present.
See wiki_dataset.ipynb notebook.
Preprocessing steps:
- Chunked into 512-token passages.
- Overlap of 32 tokens between passages.
-
Configuration Files:
- Each experiment uses a configuration file located in the
.configs/
directory. - Configuration settings are organized by RARR components:
- Query Generation (q)
- Evidence Retrieval (r)
- Agreement (a)
- The configuration filenames encode the component settings. For example, a filename like
config_q1_r1_q1.yaml
indicates that:- The first query generation experiment is being used.
- The first evidence retrieval experiment is being used.
- And so on.
- Each experiment uses a configuration file located in the
-
Running an Experiment:
- After validating the configuration settings, run the experiment by specifying the configuration file path and the dataset name.
- For example:
python run_agreement_modular.py \ --config_file_path "./configs/config_q6_r1_a1.yaml" \ --dataset_name "fava"