Skip to content

Latest commit



158 lines (119 loc) · 6.83 KB

File metadata and controls

158 lines (119 loc) · 6.83 KB

Exploiting Wild Diactitics Evaluation

The scripts provided here assume you have a Unix environment (Linux, macOS, etc.) and have been tested using Python 3.10 on macOS 13.6.7 running on an a 2020 Intel MacBook Pro.


You'll need to have Python 3.8-3.10 installed. We suggest using pyenv to manage Python installations. You'll also need to install coreutils, CMake, and Boost installed.

On Debian/Ubuntu you can install these by running:

sudo apt-get update
sudo apt-get install coreutils cmake libboost-all-dev

On macOS using Homebrew you can run:

brew install coreutils cmake boost

Initial Setup

We need to prepare the evaluation environment. All steps in this section only need to be run once.

First we need to setup the Python virtual environments used by all the scripts in this directory.

Note: You will have to rerun this if you move the parent directory and have run this command so previously.


Then we need to prepare the file in the data directory for evaluation. This will generate a local data directory with the prepared files. To do so, we run:


Finally, we need to unlock the morphological analyzer databases that are built using a dataset that isn't freely available. First purchase a copy of SAMA 3.1 from the Linguistic Data Consortium and then download it from the download page. You should have a file called LDC2010L01.tgz.

We can now unlock the database files by running:

./ /path/to/LDC2010L01.tgz

Evaluating Wild2Max

To generate and evaluate predictions for the Wild2Max dev set, we run:


To generate and evaluate predictions for the Wild2Max test set, run:


The above commands generate diacritization predictions in output/predictions/wild2max and final evaluation statistics in output/eval/wild2max.

Files in output/predictions of the form *.original.tsv contain predictions for individual genres using the original implementation of CAMeL Tools and an unmodified calima-s31 morphological database. Those of the form *.extended.tsv are created using our modified version of CAMeL Tools as well as the extended version of calima-s31. This includes predictions using our CT++ ranking algorithm.

Files in output/eval contain the computed statistics that we report in our paper and follow the *.original.tsv and *.extended.tsv conventions mentioned above.

See the TSV Output Column Reference below for more information on the contents of these files.

Evaluating WikiNewsMax

To generate and evaluate predictions for WikiNewsMax, run:


The above command generates diacritization predictions in output/predictions/wikinewsmax and final evaluation statistics in output/eval/wikinews.

We use the same file suffix naming conventions above for generated prediction and evaluation files.

For this task we produce two sets of results. Both sets produce predictions using dediacritized WikiNews text as input. One uses the original WikiNews gold diacritizations to evaluate against and uses dediac_orig_gold as a file prefix. The other uses WikiNewsMax gold and alternative gold diacritizations to evaluate against and uses dediac_max_gold as a file prefix.

See the TSV Output Column Reference below for more information on the contents of these files.

TSV Output Column Reference

Below are reference tables for the column names used in the produced output files in output/predictions and output/eval.

Prediction Files

Field Name Description
word the original word
gold_diac the gold (full) diacritization of the word
gold_diac_alt optional alternative gold (full) diacritization of the word
is_oov word is out-of-vocabulary (ie. no analyses were produced or all produced analyses are backoffs)
ct_noctx predicted diacritization using original CAMeL Tools ranking and no contextual fixes
ct_soloctx predicted diacritization using original CAMeL Tools ranking and solo word contextual fixes
ct_fullctx predicted diacritization using original CAMeL Tools ranking and full sentence contextual fixes
ctpp_soloctx predicted diacritization produced using CT++ ranking and solo word contextual fixes
ctpp_fullctx predicted diacritization produced using CT++ ranking and full sentence contextual fixes
oracle_noctx oracle (best possible) diacritization provided no contextual fixes
oracle_soloctx oracle (best possible) diacritization provided solo word contextual fixes
oracle_fullctx oracle (best possible) diacritization provided full sentence contextual fixes

Evaluation Files

Field Name Description
genre the genre being evaluated
num_words total number of words in the given genre
oov percentage of words that are out-of-vocabulary
ct_noctx_accuracy percentage of words that have a correct ct_noctx prediction
ct_soloctx_accuracy percentage of words that have a correct ct_soloctx prediction
ct_fullctx_accuracy percentage of words that have a correct ct_fullctx prediction
ctpp_soloctx_accuracy percentage of words that have a correct ctpp_soloctx prediction
ctpp_fullctx_accuracy percentage of words that have a correct ctpp_fullctx prediction
oracle_noctx_accuracy percentage of words that have a correct oracle_noctx prediction
oracle_soloctx_accuracy percentage of words that have a correct oracle_soloctx prediction
oracle_fullctx_accuracy percentage of words that have a correct oracle_fullctx prediction