Start and End properties are inaccurate #30

kla7 · 2024-06-06T21:41:37Z

Bug Description

When running spacy, the start and end values for each token are inaccurate.

For example:

        {
          "@type": "http://vocab.lappsgrid.org/Token",
          "properties": {
            "start": 26444,
            "end": 31989,
            "pos": "DT",
            "lemma": "a",
            "text": "a",
            "id": "to_5546"
          }
        }

Reproduction steps

Run spacy on any txt or mmif file.

I ran it on:

Sample text from wikipedia with the following output MMIF
Whisper output MMIF with the following output MMIF

Expected behavior

The end value should be the start value + the length of the token.

Log output

No response

Screenshots

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

marcverhagen · 2024-06-10T16:59:46Z

At some point some code was added to deal with pre-tokenized input to the spaCy app. This builds a token index and the length of that index is used to calculate the end offset. With non-tokenized input that index is build token by token, and since the length of the entire index is used the first token is length 1, the second is length 2 and so on.

app-spacy-wrapper/app.py

Lines 75 to 78 in ce95ecc

    
           if n not in tok_idx: 
        
               a.add_property("start", tok.idx) 
        
               a.add_property("end", tok.idx + len(tok_idx)) 
        
               tok_idx[n] = a.id

The code added to deal with tokenized input probably was not tested to confirm that it did the right thing with non-tokenized input.

marcverhagen · 2024-06-10T17:23:18Z

The "end" property is fixed, the "start" property was never wrong. Will test a little bit more before releasing a new version (specifically for pre-tokenized input).

marcverhagen · 2024-06-10T18:52:38Z

Well, the pretokenized parameter seems to be broken independent of what is going on in here so I will make that a separate issue.

Bumping Python SDK version, bug fixes and documentation updates - Updated to clams-python 1.2.2 - Fixed token length (issue #30) - Fixed problems with the pretokenized parameter (issue #32) - Various documentation fixes.

kla7 added the 🐛B Something isn't working label Jun 6, 2024

clams-bot added this to apps Jun 6, 2024

github-project-automation bot moved this to Todo in apps Jun 6, 2024

marcverhagen self-assigned this Jun 10, 2024

marcverhagen added a commit that referenced this issue Jun 10, 2024

Fixed token length (issue #30)

d9ceef7

marcverhagen mentioned this issue Jun 10, 2024

Develop #33

Merged

keighrim linked a pull request Jun 10, 2024 that will close this issue

Develop #33

Merged

marcverhagen closed this as completed in #33 Jun 11, 2024

github-project-automation bot moved this from Todo to Done in apps Jun 11, 2024

clams-bot unassigned marcverhagen Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start and End properties are inaccurate #30

Start and End properties are inaccurate #30

kla7 commented Jun 6, 2024 •

edited

Loading

marcverhagen commented Jun 10, 2024

marcverhagen commented Jun 10, 2024

marcverhagen commented Jun 10, 2024

Start and End properties are inaccurate #30

Start and End properties are inaccurate #30

Comments

kla7 commented Jun 6, 2024 • edited Loading

Bug Description

Reproduction steps

Expected behavior

Log output

Screenshots

Additional context

marcverhagen commented Jun 10, 2024

marcverhagen commented Jun 10, 2024

marcverhagen commented Jun 10, 2024

kla7 commented Jun 6, 2024 •

edited

Loading