-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start and End properties are inaccurate #30
Comments
At some point some code was added to deal with pre-tokenized input to the spaCy app. This builds a token index and the length of that index is used to calculate the end offset. With non-tokenized input that index is build token by token, and since the length of the entire index is used the first token is length 1, the second is length 2 and so on. Lines 75 to 78 in ce95ecc
The code added to deal with tokenized input probably was not tested to confirm that it did the right thing with non-tokenized input. |
The "end" property is fixed, the "start" property was never wrong. Will test a little bit more before releasing a new version (specifically for pre-tokenized input). |
Well, the pretokenized parameter seems to be broken independent of what is going on in here so I will make that a separate issue. |
Bug Description
When running spacy, the
start
andend
values for each token are inaccurate.For example:
Reproduction steps
Run spacy on any txt or mmif file.
I ran it on:
Expected behavior
The
end
value should be thestart
value + the length of the token.Log output
No response
Screenshots
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: