Allow multiple tokens per feature data row #99

de-code · 2020-04-01T10:09:59Z

This is carried over from #90 (comment)

Since the segmentation data is using the first two tokens of a line, it would make sense to have an option to be able to use that in DeLFT. Currently it would only use the first one.

Potential solution:

an option to specify the columns with the tokens (similar to the features)
concatenate the word embeddings and other token related vectors

Probably need to change a few places that expect a single token as an input.

/cc @kermitt2 @lfoppiano

de-code · 2020-04-07T12:50:03Z

I have now implemented something in: elifesciences/sciencebeam-trainer-delft#185

I also included low-level results. I am not sure whether they conclusive as I only have a single run with the updated dataset (that has line numbers removed). There seem to be about 1 percentage point different.

lfoppiano · 2020-04-10T21:44:20Z

We can think about it once the features channel is merged.

de-code · 2020-08-04T13:55:50Z

Related to that, for the segmentation model I have now implemented an optional feature where the I add the whole line as a separate feature (at the end), which is then tokenized within DeLFT, or not if it's only using character features. I could see a slight improvement with max chars 30 for example.

Related PRs:

This was referenced Apr 2, 2020

allow multiple token values elifesciences/sciencebeam-trainer-delft#185

Merged

max_sequence_length not used(?) #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple tokens per feature data row #99

Allow multiple tokens per feature data row #99

de-code commented Apr 1, 2020

de-code commented Apr 7, 2020

lfoppiano commented Apr 10, 2020

de-code commented Aug 4, 2020 •

edited

Loading

Allow multiple tokens per feature data row #99

Allow multiple tokens per feature data row #99

Comments

de-code commented Apr 1, 2020

de-code commented Apr 7, 2020

lfoppiano commented Apr 10, 2020

de-code commented Aug 4, 2020 • edited Loading

de-code commented Aug 4, 2020 •

edited

Loading