dl4nlp-text-summarization

Factual Hallucination Metrics for NLG Evaluation

Introduction

Recent advancements in Natural Language Generation (NLG) have improved the fluency and coherence of NLG outputs in tasks such as summarization and dialogue generation. However, these models are prone to generate content that is nonsensical, grammatically incorrect, or irrelevant to the topic, and are known as "hallucinations." Hallucinations, particularly factual inaccuracies, pose serious consequences such as spreading misinformation and violating privacy. To address this challenge, researchers have explored various measurement and mitigation methods. This paper provides an ensemble of metrics to measure whether the generated text is factually correct.

Methodology

The methodology employed in this paper involves the use of various metrics to assess and address hallucinatory content effectively. In addition to traditional metrics like ROUGE, the following metrics are utilized:

QAGS (Question-Answering for Factual Consistency): This metric generates questions about the generated summary and evaluates factual consistency by comparing the generated answers to the expected ones.
BLEURT (Context-Aware Metric): BLEURT surpasses traditional metrics like BLEU and ROUGE by employing pre-trained transformers to gauge similarity between generated and reference text, capturing nuances in quality such as fluency and coherence.
FACT (Triple Relation-Based Metric): FACT leverages pre-trained models to extract factual triples from both the source document and the summary. Its output is a ratio of how many triples extracted from the summary are also found in the source document.
SUMMAC (Sentence-Level Metric): SUMMAC breaks down source documents and summaries into sentences and computes entailment probabilities between document and summary sentences using Natural Language Inference (NLI).

Dataset

The XSum (Extreme Summarization) dataset is used for experimentation. This dataset consists of approximately 226,000 news articles from the BBC News website, each accompanied by a single-sentence summary. The summaries were written by professional editors and are considered to be high-quality references.

Models

For most of the experiments, the T5 language model is used. Specific model variants include:

t5-small: The smallest version of the t5 model.
t5-small-xsum: The small version of the t5 model, fine-tuned on the XSUM dataset.
t5-large: The t5 model with 770 million parameters.
t5-large-xsum: The large version of t5 fine-tuned on XSum.
t5-large-xsum-cnn: Based on the t5-large model, finetuned on the XSUM and CNN Daily Mail summarization datasets.

Conclusion

This paper presents a comprehensive approach to evaluate and mitigate factual hallucinations in NLG. By utilizing an ensemble of metrics, analyzing different language models, and exploring various methods for mitigating hallucinations, the paper aims to contribute to the understanding and improvement of NLG systems.

For more details, refer to the complete paper. LINK PAPER

Environment

conda create -n DL4NLP python=3.10
conda activate DL4NLP
pip install -r requirements.yml

Contain hallucationa metric such as factsumm and summac. See notebookon 'hallucination_metrics.ipynb' how to install and bug fix required for factsumm. In order to make use of GPU training do not use pip install factsumm, but clone from the original repository. For factsumm to get the package dir. Open python in terminal

> import factsumm
> factsumm.__file__

Training

(See train.py for more arguments)

python train.py --wandb-mode disabled
python train_without_wandb.py         # training in subepochs to see hallucination metrics over steps

Hallucination Evaluation

python evaluate_model_all.py

Human judgment comparison

Comparing our halluciniation metrics with human judgment of hallucination in XSUM using google-research-dataset

python human_judgement.py

Model checkpoints

Note however, that these model were only trained as a sanity check and that these models have not been used in the hallucination experiments. Bart-base and T5-large checkpoints can be found here: checkpoints

About

Paulius, Myrthe, Erencan, Luka

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
experiments		experiments
jobs		jobs
metric		metric
outputs		outputs
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
eval_metric.ipynb		eval_metric.ipynb
evaluate_model_all.py		evaluate_model_all.py
evaluate_model_rouge.py		evaluate_model_rouge.py
evaluation.py		evaluation.py
human_judgement.py		human_judgement.py
inference.py		inference.py
install_env.yml		install_env.yml
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
train_without_wandb.py		train_without_wandb.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dl4nlp-text-summarization

Table of Contents

Introduction

Methodology

Dataset

Models

Conclusion

Environment

Training

Hallucination Evaluation

Human judgment comparison

Model checkpoints

About

About

Releases

Packages

Contributors 3

Languages

p-skaisgiris/mitigating-factual-hallucinations-nlp

Folders and files

Latest commit

History

Repository files navigation

dl4nlp-text-summarization

Table of Contents

Introduction

Methodology

Dataset

Models

Conclusion

Environment

Training

Hallucination Evaluation

Human judgment comparison

Model checkpoints

About

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages