Efficiently Translating and Reconstructing Large PDFs While Preserving Layout Using PyMuPDF #4395
Replies: 4 comments 5 replies
-
You do not mention what specific problems you are encountering. So it is hard to give you recommendations. |
Beta Was this translation helpful? Give feedback.
-
I am using Block structure from PDF where i extract text, translate using my LLM model and then insert it back(with proper position using X and Y position variable) in PDF using the block structure provided by PyMUPDF. This is working fast and efficient(5sec) for PDF with a smaller number of pages. But when i try to implement this with 50-60 pages of PDF the time is exponentially increasing because of the process i am using of extracting and inserting text block backs(highly time consuming). Is there any way to solve this efficiently for any Pages of PDF.
Also, I in my output PDF i want the elements to be preserved and placed at same position as original and only the translated text should be replaced.
Thank you again!
```python
import fitz # PyMuPDF
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import re
import time
```
|
Beta Was this translation helpful? Give feedback.
-
I am not sure whether you are right when using the word "exponential" increase of processing time.
I hope you can imagine that the level of comfort you desire comes with an appropriate level of resource consumption. So my recommendation is to split large documents in page range segments of 5 to 10 pages per segment. Then invoke your script separately for each of these subdocuments using Python's |
Beta Was this translation helpful? Give feedback.
-
Thank you for response @JorjMcKie . (https://www.onlinedoctranslator.com/en/translationform) so how do these website work with PDF translating it instantly. If you have any idea. Help me out here I am also looking to do something like this. |
Beta Was this translation helpful? Give feedback.
-
I am currently working on a project where I am extracting text blocks from PDF then translating it using an LLM and placing it back in the PDF using blocks to preserve structure of original PDF. But this approach doesn't seem good for Large PDF's. Is there any way to solve this using PyMuPDF? If yes, your help would be really appreciated!
Thank you for help!
Beta Was this translation helpful? Give feedback.
All reactions