PDFMiner does not detect all pages

444 Views Asked by At

I am trying to extract text from pdfs, but I am running into an error because my script sometimes detects every page of a pdf, and sometimes only detects the first page of a pdf. I even included this line from a previous post on stackoverflow.

print(len(list(extract_pages(pdf_file))))

Anytime my script extracted just the first page, the script only detected 1 page.

I've even tried another library (PyPDF2) to extract text, but had even worse results.

If I look up the properties of the pdfs that my script mishandles, Adobe clearly shows in the pdf's properties the correct number of pages.

Below is the code I am using. Any recommendations on how I might change my script to detect all pages of a pdf would be appreciated.

import os
from os.path import isfile, join
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

pdf_dir = "/dir/pdfs/"
txt_dir = "/dir/txt/"

corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
for filename in corpus:
    print(filename)
    output_string = StringIO()
    with open(join(pdf_dir, filename), 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
        txt_name = "{}.txt".format(filename[:-4])
        with open(join(txt_dir, txt_name), mode="w", encoding='utf-8') as o:
            o.write(output_string.getvalue())
1

There are 1 best solutions below

0
On

Here is a solution. After trying different libraries in R (pdftools) and Python (pdfplumber), PyMuPDF works best.

from io import StringIO
import os
from os.path import isfile, join
import fitz

pdf_dir = "pdf path"
txt_dir = "txt path"

output_string = StringIO()

corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
for filename in corpus:
    print(filename)
    output_string = StringIO()
    doc = fitz.open(join(pdf_dir,filename))
    for page in doc:
        output_string.write(page.getText("rawdict"))
    txt_name = "{}.txt".format(filename[:-4])
    with open(join(txt_dir, txt_name), mode="w", encoding='utf-8') as o:
        o.write(output_string.getvalue())