-
Notifications
You must be signed in to change notification settings - Fork 588
Dealing with Embedded Files
Since MuPDF v1.11, PyMuPDF with its v1.11.0 can deal with embedded files.
This feature (PDF 1.4 format) allows attaching arbitrary data or files to PDF documents. With PyMuPDF, such embedded data can be added, deleted, extracted and modified.
We have included some example scripts to the resp. directory that demonstrate the use of this new feature.
Here we show some interactive sessions:
>>> doc=fitz.open("test.pdf")
>>> doc.embeddedFileCount # show number of embedded
7
>>> for i in range(doc.embeddedFileCount): # display info about them
print(doc.embeddedFileInfo(i))
{'name': 'pdftest', 'file': 'pdftest', 'desc': 'pdftest', 'size': 609, 'length': 609}
{'name': 'umlaute?', 'file': 't-ink.pdf', 'desc': 'können wir Ùmláútê?', 'size': 2389, 'length': 2389}
{'name': 'testann.py', 'file': 'testann.py', 'desc': 'Beschreibung', 'size': 1222, 'length': 1222}
{'name': 'minpdf.py', 'file': 'minpdf.py', 'desc': 'minpdf.py', 'size': 1693, 'length': 1693}
{'name': 'mit Latin', 'file': 'latin.log', 'desc': 'mit Latin in der Beschreibung, S†áe!', 'size': 40, 'length': 40}
{'name': 'test1.pdf', 'file': 'test1.pdf', 'desc': 'test1.pdf', 'size': 65917, 'length': 65917}
{'name': 'ink-demo', 'file': 't-ink.pdf', 'desc': 'Test neues FileAdd', 'size': 2389, 'length': 2389}
>>>
>>> # change the description of one entry
>>> doc.embeddedFileSetInfo("mit Latin", None, "new description without problematic characters")
0
>>> for i in range(doc.embeddedFileCount): # show what happend
print(doc.embeddedFileInfo(i))
{'name': 'pdftest', 'file': 'pdftest', 'desc': 'pdftest', 'size': 609, 'length': 609}
{'name': 'umlaute?', 'file': 't-ink.pdf', 'desc': 'können wir Ùmláútê?', 'size': 2389, 'length': 2389}
{'name': 'testann.py', 'file': 'testann.py', 'desc': 'Beschreibung', 'size': 1222, 'length': 1222}
{'name': 'minpdf.py', 'file': 'minpdf.py', 'desc': 'minpdf.py', 'size': 1693, 'length': 1693}
{'name': 'mit Latin', 'file': 'latin.log', 'desc': 'new description without problematic characters', 'size': 40, 'length': 40}
{'name': 'test1.pdf', 'file': 'test1.pdf', 'desc': 'test1.pdf', 'size': 65917, 'length': 65917}
{'name': 'ink-demo', 'file': 't-ink.pdf', 'desc': 'Test neues FileAdd', 'size': 2389, 'length': 2389}
>>>
>>> # a new entry can be entered from arbitrary data (bytes or bytearray)
>>> doc.embeddedFileAdd(b"some arbitrary data", "new data", None, "we do not need files for this")
1
>>> for i in range(doc.embeddedFileCount): # again show the result
print(doc.embeddedFileInfo(i))
{'name': 'pdftest', 'file': 'pdftest', 'desc': 'pdftest', 'size': 609, 'length': 609}
{'name': 'umlaute?', 'file': 't-ink.pdf', 'desc': 'können wir Ùmláútê?', 'size': 2389, 'length': 2389}
{'name': 'testann.py', 'file': 'testann.py', 'desc': 'Beschreibung', 'size': 1222, 'length': 1222}
{'name': 'minpdf.py', 'file': 'minpdf.py', 'desc': 'minpdf.py', 'size': 1693, 'length': 1693}
{'name': 'mit Latin', 'file': 'latin.log', 'desc': 'new description without problematic characters', 'size': 40, 'length': 40}
{'name': 'test1.pdf', 'file': 'test1.pdf', 'desc': 'test1.pdf', 'size': 65917, 'length': 65917}
{'name': 'ink-demo', 'file': 't-ink.pdf', 'desc': 'Test neues FileAdd', 'size': 2389, 'length': 2389}
{'name': 'new data', 'file': 'new data', 'desc': 'we do not need files for this', 'size': 19, 'length': 19}
>>>
>>> # new names must be unique:
>>> doc.embeddedFileAdd(b"some arbitrary data", "new data", None, "we do not need files for this")
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
doc.embeddedFileAdd(b"some arbitrary data", "new data", None, "we do not need files for this")
File "C:\Users\Jorj\AppData\Local\Programs\Python\Python36\lib\site-packages\fitz\fitz.py", line 335, in embeddedFileAdd
return _fitz.Document_embeddedFileAdd(self, buffer, name, filename, desc)
Exception: Name already exists in embedded files
>>>
If an entry is supported file type, we can extract and directly open it from memory:
>>> stream=doc.embeddedFileGet("test1.pdf")
>>> len(stream)
65917
>>> doc2 = fitz.open("pdf", stream)
>>> doc2.pageCount
1
>>>
HOWTO Button annots with JavaScript
HOWTO work with PDF embedded files
HOWTO extract text from inside rectangles
HOWTO extract text in natural reading order
HOWTO create or extract graphics
HOWTO create your own PDF Drawing
Rectangle inclusion & intersection
Metadata & bookmark maintenance