We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When extracting text from pdf (https://www.aanda.org/articles/aa/pdf/2006/02/aa3061-05.pdf), I got a lot of warning and the extraction failed.
My code is as: import os import sys import importlib importlib.reload(sys) from pdfminer.pdfparser import PDFParser,PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LTTextBoxHorizontal,LAParams from pdfminer.pdfinterp import PDFTextExtractionNotAllowed def parse(path,target): if (os.path.exists(target)): os.remove(target) fp = open(path, 'rb') praser = PDFParser(fp) doc = PDFDocument() praser.set_document(doc) doc.set_parser(praser)
doc.initialize() if not doc.is_extractable: raise PDFTextExtractionNotAllowed else: rsrcmgr = PDFResourceManager() laparams = LAParams(all_texts = True) device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in doc.get_pages(): # doc.get_pages() 获取page列表 interpreter.process_page(page) layout = device.get_result() for x in layout: if (isinstance(x, LTTextBoxHorizontal)): with open(target, 'a', encoding='utf-8') as f: results = x.get_text() # print(results) f.write(results + '\n')
if name == 'main': path = r'./pdf/aa3061-05.pdf' parse(path,path.replace('.pdf','.txt'))
the warnings: ...... WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 4 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5 ......
The text was updated successfully, but these errors were encountered:
I'm getting tem same problem. I'll let you know if I fix it.
Sorry, something went wrong.
Could you share your solution, please! I have the same problem.
No branches or pull requests
When extracting text from pdf (https://www.aanda.org/articles/aa/pdf/2006/02/aa3061-05.pdf), I got a lot of warning and the extraction failed.
My code is as:
import os
import sys
import importlib
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
def parse(path,target):
if (os.path.exists(target)):
os.remove(target)
fp = open(path, 'rb')
praser = PDFParser(fp)
doc = PDFDocument()
praser.set_document(doc)
doc.set_parser(praser)
if name == 'main':
path = r'./pdf/aa3061-05.pdf'
parse(path,path.replace('.pdf','.txt'))
the warnings:
......
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 4
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
......
The text was updated successfully, but these errors were encountered: