Skip to content

Commit 0288ad4

Browse files
authored
Small fixes (#300)
* typo and formatting * fix import
1 parent cc78653 commit 0288ad4

File tree

2 files changed

+44
-30
lines changed

2 files changed

+44
-30
lines changed

neuralcoref/neuralcoref.pyx

+1-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ from thinc.v2v import Model, ReLu, Affine
4141
from thinc.api import chain, clone
4242
# from thinc.neural.util import get_array_module
4343

44-
from .file_utils import NEURALCOREF_MODEL_PATH
44+
from file_utils import NEURALCOREF_MODEL_PATH
4545

4646
##############################
4747
## DEFAULT INFERENCE VALUES ##

neuralcoref/train/training.md

+43-29
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,20 @@
33
Please check our [detailed blog post](https://medium.com/huggingface/how-to-train-a-neural-coreference-model-neuralcoref-2-7bb30c1abdfe) together with these short notes.
44

55
## Install
6+
67
As always, we recommend creating a clean environment (conda or virtual env) to install and train the model.
78

89
You will need to install [pyTorch](http://pytorch.org/), the neuralcoref package with the additional training requirements and download a language model for spacy.
910
Currently this can be done (assuming an English language model) with
10-
````bash
11+
12+
```bash
1113
conda install pytorch -c pytorch
1214
pip install -r ./train/training_requirements.txt -e .
1315
python -m spacy download en
14-
````
16+
```
1517

1618
## Get the data
19+
1720
The following assumes you want to train on English, Arabic or Chinese.
1821
If you want to train on another language, see the section [train on a new language](#train-on-a-new-language) below.
1922

@@ -24,63 +27,71 @@ and combine these skeleton files with the OntoNotes files to get the `*._conll`
2427

2528
This can be done by executing the script [compile_coref_data.sh](/neuralcoref/train/conll_processing_script/compile_coref_data.sh)
2629
or by following these steps:
27-
* From the [CoNLL 2012 download site](http://conll.cemantix.org/2012/download/), download and extract:
28-
* http://conll.cemantix.org/2012/download/conll-2012-train.v4.tar.gz
29-
* http://conll.cemantix.org/2012/download/conll-2012-development.v4.tar.gz
30-
* http://conll.cemantix.org/2012/download/test/conll-2012-test-key.tar.gz
31-
* http://conll.cemantix.org/2012/download/test/conll-2012-test-official.v9.tar.gz
32-
* http://conll.cemantix.org/2012/download/conll-2012-scripts.v3.tar.gz
33-
* http://conll.cemantix.org/download/reference-coreference-scorers.v8.01.tar.gz
34-
* Move `reference-coreference-scorers` into the folder `conll-2012/` and rename to `scorer`
35-
* If you are using Python 3.X, you have to edit the `conll-2012/v3/scripts/skeleton2conll.py` file
36-
* Change `except InvalidSexprException, e:` to `except InvalidSexprException as e`
37-
* Change all `print ` statements to `print()`
38-
* Create the `*._conll` text files by executing
39-
* `conll-2012/v3/scripts/skeleton2conll.sh -D path_to_ontonotes_folder/data/ conll-2012` (may take a little while)
40-
* This will create `*.v4_gold_conll` files in each subdirectory of the `conll-2012` `data` folder.
41-
* Assemble the appropriate files into one large file each for training, development and testing
42-
* `my_lang` can be `english`, `arabic` or `chinese`
43-
* `cat conll-2012/v4/data/train/data/my_lang/annotations/*/*/*/*.v4_gold_conll >> train.my_lang.v4_gold_conll`
44-
* `cat conll-2012/v4/data/development/data/my_lang/annotations/*/*/*/*.v4_gold_conll >> dev.my_lang.v4_gold_conll`
45-
* `cat conll-2012/v4/data/test/data/my_lang/annotations/*/*/*/*.v4_gold_conll >> test.my_lang.v4_gold_conll`
30+
31+
- From the [CoNLL 2012 download site](http://conll.cemantix.org/2012/download/), download and extract:
32+
- http://conll.cemantix.org/2012/download/conll-2012-train.v4.tar.gz
33+
- http://conll.cemantix.org/2012/download/conll-2012-development.v4.tar.gz
34+
- http://conll.cemantix.org/2012/download/test/conll-2012-test-key.tar.gz
35+
- http://conll.cemantix.org/2012/download/test/conll-2012-test-official.v9.tar.gz
36+
- http://conll.cemantix.org/2012/download/conll-2012-scripts.v3.tar.gz
37+
- http://conll.cemantix.org/download/reference-coreference-scorers.v8.01.tar.gz
38+
- Move `reference-coreference-scorers` into the folder `conll-2012/` and rename to `scorer`
39+
- If you are using Python 3.X, you have to edit the `conll-2012/v3/scripts/skeleton2conll.py` file
40+
- Change `except InvalidSexprException, e:` to `except InvalidSexprException as e`
41+
- Change all `print` statements to `print()`
42+
- Create the `*._conll` text files by executing
43+
- `conll-2012/v3/scripts/skeleton2conll.sh -D path_to_ontonotes_folder/data/ conll-2012` (may take a little while)
44+
- This will create `*.v4_gold_conll` files in each subdirectory of the `conll-2012` `data` folder.
45+
- Assemble the appropriate files into one large file each for training, development and testing
46+
- `my_lang` can be `english`, `arabic` or `chinese`
47+
- `cat conll-2012/v4/data/train/data/my_lang/annotations/*/*/*/*.v4_gold_conll >> train.my_lang.v4_gold_conll`
48+
- `cat conll-2012/v4/data/development/data/my_lang/annotations/*/*/*/*.v4_gold_conll >> dev.my_lang.v4_gold_conll`
49+
- `cat conll-2012/v4/data/test/data/my_lang/annotations/*/*/*/*.v4_gold_conll >> test.my_lang.v4_gold_conll`
4650

4751
## Prepare the data
48-
Once you have the set of `*.v4_gold_conll` files, move these files into separate (`train`, `test`, `dev`) subdirectories inside a new directory. You can use the already present `data` directory or create another directory anywhere you want. Now, you can prepare the training data by running
52+
53+
Once you have the set of `*.v4_gold_conll` files, move these files into separate (`train`, `test`, `dev`) subdirectories inside a new directory. You can use the already present `data` directory or create another directory anywhere you want. Now, you can prepare the training data by running
4954
[conllparser.py](/neuralcoref/train/conllparser.py) on each split of the data set (`train`, `test`, `dev`) as
5055

51-
````bash
56+
```bash
5257
python -m neuralcoref.train.conllparser --path ./$path_to_data_directory/train/
5358
python -m neuralcoref.train.conllparser --path ./$path_to_data_directory/test/
5459
python -m neuralcoref.train.conllparser --path ./$path_to_data_directory/dev/
55-
````
60+
```
5661

5762
Conllparser will:
63+
5864
- parse the `*._conll` files using spaCy,
5965
- identify predicted mentions,
6066
- compute the mentions features (see our blog post), and
6167
- gather the mention features in a set of numpy arrays to be used as input for the neural net model.
6268

6369
## Train the model
70+
6471
Once the files have been pre-processed
6572
(you should have a set of `*.npy` files in a sub-directory `/numpy` in each of your (`train`|`test`|`dev`) data folder),
6673
you can start the training process using [learn.py](/neuralcoref/train/learn.py), for example as
67-
````bash
74+
75+
```bash
6876
python -m neuralcoref.train.learn --train ./data/train/ --eval ./data/dev/
69-
````
77+
```
78+
79+
There are many parameters and options for the training. You can list them with the usual
7080

71-
There many parameters and options for the training. You can list them with the usual
72-
````bash
81+
```bash
7382
python -m neuralcoref.train.learn --help
74-
````
83+
```
7584

7685
You can follow the training by running [Tensorboard for pyTorch](https://github.com/lanpa/tensorboard-pytorch)
7786
(it requires a version of Tensorflow, any version will be fine). Run it with `tensorboard --logdir runs`.
7887

7988
## Some details on the training
89+
8090
The model and the training as thoroughfully described in our
8191
[very detailed blog post](https://medium.com/huggingface/how-to-train-a-neural-coreference-model-neuralcoref-2-7bb30c1abdfe).
8292
The training process is similar to the mention-ranking training described in
8393
[Clark and Manning (2016)](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf), namely:
94+
8495
- A first step of training uses a standard cross entropy loss on the mention pair labels,
8596
- A second step of training uses a cross entropy loss on the top pairs only, and
8697
- A third step of training using a slack-rescaled ranking loss.
@@ -90,15 +101,18 @@ With the default option, the training will switch from one step to the other as
90101
Traing the model with the default hyper-parameters reaches a test loss of about 61.2 which is lower than the mention ranking test loss of 64.7 reported in [Clark and Manning (2016)](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf).
91102

92103
Some possible explanations:
104+
93105
- Our mention extraction function is a simple rule-based function (in [document.py](/document.py)) that was not extensively tuned on the CoNLL dataset and as a result only identify about 90% of the gold mentions in the CoNLL-2012 dataset (see the evaluation at the start of the training) thereby reducing the maximum possible score. Manually tuning a mention identification module can be a lengthy process that basically involves designing a lot of heuristics to prune spurious mentions which keeping a high recall (see for example the [rule-based mention extraction used in CoreNLP](http://www.aclweb.org/anthology/D10-1048)). An alternative is train an end-to-end identification module as used in the AllenAI coreference module but this is a lot more complex (you have to learn a pruning function) and the focus of the neuralcoref project is to have a coreference module with a good trade-off between accuracy and simplicity/speed.
94106
- The hyper-parameters and the optimization procedure has not been fully tuned and it is likely possible to find better hyper-parameters and smarter ways to optimize. One possibiility is to adjust the balance between the gradients backpropagated in the single-mention and the mentions-pair feedforward networks (see our [blog post](https://medium.com/huggingface/how-to-train-a-neural-coreference-model-neuralcoref-2-7bb30c1abdfe) for more details on the model architecture). Here again, we aimed for a balance between the accuracy and the training speed. As a result, the model trains in about 18h versus about a week for the original model of [Clark and Manning (2016)](http://cs.stanford.edu/people/kevclark/resources/clark-manning-emnlp2016-deep.pdf) and 2 days for the current state-of-the-art model of AllenAI.
95107
- Again for the sake of high throughput, the parse tree output by the [standard English model](https://spacy.io/models/en#en_core_web_sm) of spaCy 2 (that we used for these tests) are slightly less accurate than the carefully tuned CoreNLP pars trees (but they are way faster to compute!) and will lead to a slightly higher percentage of wrong parsing annotations.
96108
- Eventually, it may also be interesting to use newer word-vectors like the [ELMo](https://arxiv.org/abs/1802.05365) as they were shown to be able to increase the state-or-the-art corerefence model F1 test measure by more than 3 percents.
97109

98110
## Train on a new language
111+
99112
Training on a new language is now possible. However, do not expect it to be a plug-in operation as it involves finding a good annotated dataset and adapting the file-loading and mention-extraction functions to your file format and your language syntax (parse tree).
100113

101114
To boot-strap your work, I detail here the general step you should follow:
115+
102116
- Find a corpus with coreference annotations (as always, the bigger, the better).
103117
- Check that spaCy [supports your language](https://spacy.io/models/) (i.e. is able to parse it). If not, you will have to find another parser that is able to parse your language and integrate it with the project (might involve quite large modifications to neuralcoref depending on the parser).
104118
- Find a set of pre-trained word vectors in your language (gloVe or others).

0 commit comments

Comments
 (0)