Efficiently Translating and Reconstructing Large PDFs While Preserving Layout Using PyMuPDF #4395

Prasaderp · 2025-03-21T10:10:42Z

Prasaderp
Mar 21, 2025

I am currently working on a project where I am extracting text blocks from PDF then translating it using an LLM and placing it back in the PDF using blocks to preserve structure of original PDF. But this approach doesn't seem good for Large PDF's. Is there any way to solve this using PyMuPDF? If yes, your help would be really appreciated!

Thank you for help!

JorjMcKie · 2025-03-21T10:54:15Z

JorjMcKie
Mar 21, 2025
Maintainer

You do not mention what specific problems you are encountering. So it is hard to give you recommendations.

0 replies

Prasaderp · 2025-03-21T11:03:41Z

Prasaderp
Mar 21, 2025
Author

I am using Block structure from PDF where i extract text, translate using my LLM model and then insert it back(with proper position using X and Y position variable) in PDF using the block structure provided by PyMUPDF. This is working fast and efficient(5sec) for PDF with a smaller number of pages. But when i try to implement this with 50-60 pages of PDF the time is exponentially increasing because of the process i am using of extracting and inserting text block backs(highly time consuming). Is there any way to solve this efficiently for any Pages of PDF. Also, I in my output PDF i want the elements to be preserved and placed at same position as original and only the translated text should be replaced. Thank you again! ```python import fitz # PyMuPDF import torch from transformers import AutoModelForSeq2SeqLM, AutoTokenizer import re import time ```

1 reply

JorjMcKie Mar 21, 2025
Maintainer

Please do not share your code like this! It is extremely hard to read and impossible to otherwise work with.
Please use GitHub's code blocks!

JorjMcKie · 2025-03-21T12:04:25Z

JorjMcKie
Mar 21, 2025
Maintainer

I am not sure whether you are right when using the word "exponential" increase of processing time.
But definitely, page.insert_htmlbox() is a method that is feature-rich and thus has high requirements.
Whenever you insert text into a rectangle

a HTML parser is invoked that reads your text, determines required fonts to render it, invokes the text shaper to create the output stream for insertion to PDF
the required space for the result produced under 1. is computed. If exceeding the provided rectangle, (virtually) increase the rectangle by some scaling factor and loop back to point 1. Otherwise ...
Write the result into the original rectangle, scaling all the content by the previously computed scaling factor.

I hope you can imagine that the level of comfort you desire comes with an appropriate level of resource consumption.
The above algorithm is executed multiple times per page - i.e. for every text block.

So my recommendation is to split large documents in page range segments of 5 to 10 pages per segment. Then invoke your script separately for each of these subdocuments using Python's multiprocessing module. When each of the page range segments have been translated, join them again into one document.

0 replies

Prasaderp · 2025-03-21T12:12:37Z

Prasaderp
Mar 21, 2025
Author

Thank you for response @JorjMcKie . (https://www.onlinedoctranslator.com/en/translationform) so how do these website work with PDF translating it instantly. If you have any idea. Help me out here I am also looking to do something like this.

4 replies

JorjMcKie Mar 21, 2025
Maintainer

For sure they also use advanced parallel processing techniques - for both, the language translation part and the text shaping part.
If you want to do a similar job, then you must use parallel processing too.

Prasaderp Mar 21, 2025
Author

Also, do you think there could be any other possible way to achieve what I am trying instead of inserting each Text blocks back again in PDF after translating.

JorjMcKie Mar 21, 2025
Maintainer

If your output language is one that requires text shaping (a MUST DO for many Asian scripts), then you must use insert_htmlbox.
You must also use logic that is capable of dynamically scaling down the translated text, because translated text may need more space.
This the comfort you requested to have!
If you could write a completely new PDF, without having to care about "maintaining layout", things can get much easier and faster.

But why do you resist using multiprocessing?

Prasaderp Mar 22, 2025
Author

As you could see in My code I am already using insert_htmlbox which provides perfect output. I am just finding more ways to do this task much more efficiently for large PDF (100 pages approx). Also, I am using a Local Computer for my Project for now, which doesn't have much resources to offer.(for multiprocessing)

So, If there is a solution other than multiprocessing to solve the Text translation in PDF. Let me know!
Thanks

Efficiently Translating and Reconstructing Large PDFs While Preserving Layout Using PyMuPDF #4395

Uh oh!

Uh oh!

Prasaderp Mar 21, 2025

Replies: 4 comments · 5 replies

Uh oh!

JorjMcKie Mar 21, 2025 Maintainer

Uh oh!

Uh oh!

Prasaderp Mar 21, 2025 Author

Uh oh!

JorjMcKie Mar 21, 2025 Maintainer

Uh oh!

JorjMcKie Mar 21, 2025 Maintainer

Uh oh!

Prasaderp Mar 21, 2025 Author

Uh oh!

JorjMcKie Mar 21, 2025 Maintainer

Uh oh!

Prasaderp Mar 21, 2025 Author

Uh oh!

JorjMcKie Mar 21, 2025 Maintainer

Uh oh!

Uh oh!

Prasaderp Mar 22, 2025 Author

Prasaderp
Mar 21, 2025

Replies: 4 comments 5 replies

JorjMcKie
Mar 21, 2025
Maintainer

Prasaderp
Mar 21, 2025
Author

JorjMcKie Mar 21, 2025
Maintainer

JorjMcKie
Mar 21, 2025
Maintainer

Prasaderp
Mar 21, 2025
Author

JorjMcKie Mar 21, 2025
Maintainer

Prasaderp Mar 21, 2025
Author

JorjMcKie Mar 21, 2025
Maintainer

Prasaderp Mar 22, 2025
Author