Process ocr_caption lines #1466

0dinD · 2025-01-30T16:57:59Z

Previously in the hOCR transform, ocr_caption spans were ignored. This PR fixes that issue so that OCRmyPDF processes image caption lines instead of ignoring them. As far as I know, ocr_caption is the only missing line type that Tesseract can produce, see: https://github.com/tesseract-ocr/tesseract/blob/3157ff0e741ea5c85e16fbd1c6edf20f30eccbd3/src/api/hocrrenderer.cpp#L231-L248

codecov · 2025-02-02T00:34:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.16%. Comparing base (137b054) to head (b7d63f3).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1466   +/-   ##
=======================================
  Coverage   90.16%   90.16%           
=======================================
  Files          95       95           
  Lines        7128     7128           
  Branches      729      729           
=======================================
  Hits         6427     6427           
  Misses        496      496           
  Partials      205      205

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Process ocr_caption lines

b7d63f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process ocr_caption lines #1466

Process ocr_caption lines #1466

0dinD commented Jan 30, 2025 •

edited

Loading

codecov bot commented Feb 2, 2025 •

edited

Loading

Process ocr_caption lines #1466

Are you sure you want to change the base?

Process ocr_caption lines #1466

Conversation

0dinD commented Jan 30, 2025 • edited Loading

codecov bot commented Feb 2, 2025 • edited Loading

Codecov Report

0dinD commented Jan 30, 2025 •

edited

Loading

codecov bot commented Feb 2, 2025 •

edited

Loading