Wrong Mention Type one-hot vectors during training due to a small bug in dataset.py

I think there is a small bug in dataset.py that affects the building of the Mention Type one-hot vectors of antecedent mentions in the pair features during training. Due to the use of slicing by a colon in the first dimension, the assignment is made on the full columns referred by the index in the 1-D array `ant_features_raw[:, 0]`, which contains the mention type of the antecedent mentions expressed as integer. The expected behaviour I think was to put at 1 a single bit only, indexed by the 1-D array, for each row/antecedent mention, [as it's done for the main mention](https://github.com/huggingface/neuralcoref/blob/60338df6f9b0a44a6728b442193b7c66653b0731/neuralcoref/train/dataset.py#L185).

https://github.com/huggingface/neuralcoref/blob/60338df6f9b0a44a6728b442193b7c66653b0731/neuralcoref/train/dataset.py#L230-L231

This causes a mismatch between the training features and the inference ones: in [neuralcoref.pyx](https://github.com/huggingface/neuralcoref/blob/60338df6f9b0a44a6728b442193b7c66653b0731/neuralcoref/neuralcoref.pyx#L721), the mention type is correctly encoded as a one-hot vector for each mention, and then copied in the pair features for the antecedent mentions.

This is a simple example with numpy comparing actual vs expected results:
<img width="553" alt="Screenshot 2022-03-11 at 13 09 25" src="https://user-images.githubusercontent.com/1508097/157864402-2fc75f02-4af1-4240-a143-88fc6cb21082.png">



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong Mention Type one-hot vectors during training due to a small bug in dataset.py #340

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	ant_features = np.zeros((pairs_length, SIZE_FS - SIZE_GENRE))
	ant_features[:, ant_features_raw[:, 0]] = 1

Wrong Mention Type one-hot vectors during training due to a small bug in dataset.py #340

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions