Text extraction from a Image using OCR stored in PDF #4414
Unanswered
Prasaderp
asked this question in
Looking for help
Replies: 1 comment 1 reply
-
PyMuPDF supports "partial OCR":
... get_textpage_ocr which results in a joint "corpus" of all text on the page. The extraction sequence of this is however
So you need to use sorting by geometrical information when required. A good first approximation can be achieved by this snippet textpage = page.get_text_ocr(dpi=150, partial=True,...)
blocks = page.get_text("blocks", textpage=textpage, sort=True) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
So, I have a PDF which has a Title stored using a Image and also normal text. How can i extract both normal text and OCR text on the image together using PyMuPDF. I am able to extract text from PDF but i also want to extract the OCR image text too from same PDF which actually is the Title of the PDF.
Seehere in the Image below I can extract all the normal text which I have selected using ctrl+A but here u can see some Text inside the Images eg: Attention to, Farm etc caanot be extracted. How can I achieve that too!
Beta Was this translation helpful? Give feedback.
All reactions