Z-Score is a Python package for evaluating disfluency removal systems on spontaneous speech. It provides:
- Word-Level Metrics (E-Scores): Precision, Recall, and F1 for disfluency removal
- Span-Level Metrics (Z-Scores): Fine-grained evaluation of how models handle specific disfluency types, EDITED, INTJ, and PRN.
This framework complements standard token-level metrics by offering interpretability and diagnostics, enabling researchers and practitioners to uncover why models succeed or fail and improve targeted modeling efforts.
This code accompanies our paper, Z-Scores: A Metric for Linguistically Assessing Disfluency Removal.
Clone this repository and install it in editable mode:
git clone https://github.com/mariateleki/zscore.git
cd zscore
pip install -e .This will install zscore and its dependencies locally, allowing you to import and use it in any Python environment.
Obtain Switchboard (Treebank-3, LDC99T42) and place it in data/treebank_3/.
Create a CSV file with two columns:
| filename | generated-text |
|---|---|
| sw2005.mrg | uh I think we should go now |
| sw3007.mrg | well yeah that sounds good |
filename: The reference treebank file ID (e.g.,sw2005.mrg)generated-text: The model-generated output to evaluate
from zscore.zscore import evaluate_file
evaluate_file("path/to/input.csv")Our package allows you to integrate evaluation directly into your pipeline.
| filename | generated-text | e_p | e_r | e_f | z_e | z_i | z_p |
|---|---|---|---|---|---|---|---|
| sw2005.mrg | uh I think we should go now | 0.85 | 0.80 | 0.82 | 0.78 | 0.05 | 0.81 |
| sw3007.mrg | well yeah that sounds good | 0.90 | 0.87 | 0.88 | 0.83 | 0.04 | 0.85 |
The figure below illustrates the Z-Score evaluation pipeline, including our deterministic alignment module (A), which enables both word-level (E-Score) and span-level (Z-Score) evaluation:
- Deterministic Alignment Module: Aligns generated outputs with disfluent transcripts for reliable scoring
- Category-Aware Metrics: Breaks down performance across disfluency types (EDITED, INTJ, PRN)
- Batch Evaluation Support: Process entire datasets in a single run
We welcome contributions! To get started:
-
Fork & Clone
git clone https://github.com/<your-username>/zscore.git cd zscore pip install -e .
This installs
zscore. -
Run Tests
PYTHONPATH=src python -m unittest discover -s tests
Make sure all tests pass before opening a pull request.
-
Submitting Changes
- Open a pull request (PR) with a clear description of your changes.
- Include tests for new functionality.
- If you are adding features or modifying behavior, update this README with relevant instructions.
- The framework is designed for research use and works with Switchboard-style parse trees, but it is generalizable to other corpora.
- We include the metaprompting experiments as part of DRES.
If you use this code, please use the following citation:
@inproceedings{teleki2025zscores,
title={Z-Scores: A Metric for Linguistically Assessing Disfluency Removal},
author={Teleki, Maria and Janjur, Sai Tejas and Liu, Haoran and Grabner, Oliver and Verma, Ketan and Docog, Thomas and Dong, Xiangjue and Shi, Lingfeng and Wang, Cong and Birkelbach, Stephanie and Kim, Jason and Zhang, Yin and Caverlee, James},
year={2025},
}