Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont= #12

Open
jackyetz opened this issue Mar 3, 2019 · 3 comments
Open

WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont= #12

jackyetz opened this issue Mar 3, 2019 · 3 comments

Comments

@jackyetz
Copy link

jackyetz commented Mar 3, 2019

When extracting text from pdf (https://www.aanda.org/articles/aa/pdf/2006/02/aa3061-05.pdf), I got a lot of warning and the extraction failed.

My code is as:
import os
import sys
import importlib
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
def parse(path,target):
if (os.path.exists(target)):
os.remove(target)
fp = open(path, 'rb')
praser = PDFParser(fp)
doc = PDFDocument()
praser.set_document(doc)
doc.set_parser(praser)

doc.initialize()

if not doc.is_extractable:
    raise PDFTextExtractionNotAllowed
else:
    rsrcmgr = PDFResourceManager()
    laparams = LAParams(all_texts = True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    for page in doc.get_pages(): # doc.get_pages() 获取page列表
        interpreter.process_page(page)
        layout = device.get_result()
        for x in layout:
            if (isinstance(x, LTTextBoxHorizontal)):
                with open(target, 'a', encoding='utf-8') as f:
                    results = x.get_text()
                    # print(results)
                    f.write(results + '\n')

if name == 'main':
path = r'./pdf/aa3061-05.pdf'
parse(path,path.replace('.pdf','.txt'))

the warnings:
......
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 4
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
......

@paulfwb
Copy link

paulfwb commented May 3, 2020

I'm getting tem same problem.
I'll let you know if I fix it.

@rocket2016
Copy link

Could you share your solution, please! I have the same problem.

@rocket2016
Copy link

I'm getting tem same problem.
I'll let you know if I fix it.

Could you share your solution, please! I have the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants