Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

empty targets list #2

Closed
keighrim opened this issue Apr 3, 2024 · 1 comment · Fixed by #5
Closed

empty targets list #2

keighrim opened this issue Apr 3, 2024 · 1 comment · Fixed by #5
Labels
🐛B Something isn't working

Comments

@keighrim
Copy link
Member

keighrim commented Apr 3, 2024

Bug Description

Originally reported by @wricketts


For some Sentence objects, we see targets are empty lists. While investigating the app code, I realized this is the only spot where the sentences' targets prop is written;

app-doctr-wrapper/app.py

Lines 134 to 141 in 780430f

for word in line.words:
if word.confidence > 0.4:
start = text_document.text_value.find(word.value)
end = start + len(word.value)
token = self.Token(view.new_annotation(at_type=Uri.TOKEN), text_document, start, end)
token_bb = create_bbox(view, word.geometry, "text", representative.id)
create_alignment(view, token.region.id, token_bb.id)
sentence.add_token(token)

So an empty targets list seems to mean the OCR result showed no words in that line...? In that case , I think this wrapper app should ignore such lines instead of generating empty sentences.

Reproduction steps

(screenshot from @wricketts 's report)

screenshot_2024-03-30_at_9 07 54___pm

Expected behavior

No response

Log output

No response

Screenshots

No response

Additional context

No response

@keighrim keighrim added the 🐛B Something isn't working label Apr 3, 2024
@clams-bot clams-bot added this to apps Apr 3, 2024
@github-project-automation github-project-automation bot moved this to Todo in apps Apr 3, 2024
@keighrim
Copy link
Member Author

Based on the newly documented input spec of the RFB (downstream) app (clamsproject/app-role-filler-binder#3), here's some additional implementation plans;

  1. no more filtering out of low-confidence recognition results, and keep all tokens from the model
  2. @text value should reflect the structure, even when it means to have redundant information available Sentence and Paragraph annotations
    • we actually talked about removing Sentence and Paragraph annotations to reduce the redundancty, but I realized that is not a good idea, in that it probably means that we also remove block- and line-level bounding boxes, and that's actually loss of information.
    • so for doing this, it's probably easier to build the @text values in bottom-up way, instead of using render() call available in the docTR library

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛B Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant