-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple tokens per feature data row #99
Comments
I have now implemented something in: elifesciences/sciencebeam-trainer-delft#185 I also included low-level results. I am not sure whether they conclusive as I only have a single run with the updated dataset (that has line numbers removed). There seem to be about 1 percentage point different. |
We can think about it once the features channel is merged. |
Related to that, for the segmentation model I have now implemented an optional feature where the I add the whole line as a separate feature (at the end), which is then tokenized within DeLFT, or not if it's only using character features. I could see a slight improvement with max chars 30 for example. Related PRs: |
This is carried over from #90 (comment)
Since the segmentation data is using the first two tokens of a line, it would make sense to have an option to be able to use that in DeLFT. Currently it would only use the first one.
Potential solution:
Probably need to change a few places that expect a single token as an input.
/cc @kermitt2 @lfoppiano
The text was updated successfully, but these errors were encountered: