Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start and End properties are inaccurate #30

Closed
kla7 opened this issue Jun 6, 2024 · 3 comments · Fixed by #33
Closed

Start and End properties are inaccurate #30

kla7 opened this issue Jun 6, 2024 · 3 comments · Fixed by #33
Labels
🐛B Something isn't working

Comments

@kla7
Copy link

kla7 commented Jun 6, 2024

Bug Description

When running spacy, the start and end values for each token are inaccurate.

For example:

        {
          "@type": "http://vocab.lappsgrid.org/Token",
          "properties": {
            "start": 26444,
            "end": 31989,
            "pos": "DT",
            "lemma": "a",
            "text": "a",
            "id": "to_5546"
          }
        }

Reproduction steps

Run spacy on any txt or mmif file.

I ran it on:

  1. Sample text from wikipedia with the following output MMIF
  2. Whisper output MMIF with the following output MMIF

Expected behavior

The end value should be the start value + the length of the token.

Log output

No response

Screenshots

No response

Additional context

No response

@kla7 kla7 added the 🐛B Something isn't working label Jun 6, 2024
@clams-bot clams-bot added this to apps Jun 6, 2024
@github-project-automation github-project-automation bot moved this to Todo in apps Jun 6, 2024
@marcverhagen marcverhagen self-assigned this Jun 10, 2024
@marcverhagen
Copy link
Contributor

At some point some code was added to deal with pre-tokenized input to the spaCy app. This builds a token index and the length of that index is used to calculate the end offset. With non-tokenized input that index is build token by token, and since the length of the entire index is used the first token is length 1, the second is length 2 and so on.

app-spacy-wrapper/app.py

Lines 75 to 78 in ce95ecc

if n not in tok_idx:
a.add_property("start", tok.idx)
a.add_property("end", tok.idx + len(tok_idx))
tok_idx[n] = a.id

The code added to deal with tokenized input probably was not tested to confirm that it did the right thing with non-tokenized input.

marcverhagen added a commit that referenced this issue Jun 10, 2024
@marcverhagen
Copy link
Contributor

The "end" property is fixed, the "start" property was never wrong. Will test a little bit more before releasing a new version (specifically for pre-tokenized input).

@marcverhagen
Copy link
Contributor

Well, the pretokenized parameter seems to be broken independent of what is going on in here so I will make that a separate issue.

@marcverhagen marcverhagen mentioned this issue Jun 10, 2024
@keighrim keighrim linked a pull request Jun 10, 2024 that will close this issue
marcverhagen added a commit that referenced this issue Jun 11, 2024
Bumping Python SDK version, bug fixes and documentation updates

- Updated to clams-python 1.2.2
- Fixed token length (issue #30)
- Fixed problems with the pretokenized parameter (issue #32) 
- Various documentation fixes.
@github-project-automation github-project-automation bot moved this from Todo to Done in apps Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛B Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants