Text extraction from a Image using OCR stored in PDF #4414

Prasaderp · 2025-03-28T09:41:04Z

Prasaderp
Mar 28, 2025

So, I have a PDF which has a Title stored using a Image and also normal text. How can i extract both normal text and OCR text on the image together using PyMuPDF. I am able to extract text from PDF but i also want to extract the OCR image text too from same PDF which actually is the Title of the PDF.

Seehere in the Image below I can extract all the normal text which I have selected using ctrl+A but here u can see some Text inside the Images eg: Attention to, Farm etc caanot be extracted. How can I achieve that too!

JorjMcKie · 2025-03-28T13:10:30Z

JorjMcKie
Mar 28, 2025
Maintainer

PyMuPDF supports "partial OCR":

accepts "normal" / standard PDF text
adds text found in images

... get_textpage_ocr which results in a joint "corpus" of all text on the page. The extraction sequence of this is however

standard text
OCR-ed text

So you need to use sorting by geometrical information when required. A good first approximation can be achieved by this snippet

textpage = page.get_text_ocr(dpi=150, partial=True,...)
blocks = page.get_text("blocks", textpage=textpage, sort=True)

1 reply

Prasaderp Mar 28, 2025
Author

Thanks! This helps a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction from a Image using OCR stored in PDF #4414

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Text extraction from a Image using OCR stored in PDF #4414

Prasaderp Mar 28, 2025

Replies: 1 comment · 1 reply

JorjMcKie Mar 28, 2025 Maintainer

Prasaderp Mar 28, 2025 Author

Prasaderp
Mar 28, 2025

Replies: 1 comment 1 reply

JorjMcKie
Mar 28, 2025
Maintainer

Prasaderp Mar 28, 2025
Author