pdf2image conversion of multi page PDFs to images returns the last page on all images

5.2k Views Asked by At

So when I use the pdf2image python import, and pass a multi page PDF into the convert_from_bytes()- or convert_from_path() method, the output array does contain multiple images - but all images are of the last PDF page (whereas I would've expected that each image represented one of the PDF pages).

The output looks something like this:

pdf2image conversion bug

Any idea on why this would occur? I can't find any solution to this online. I've found some vague suggestion that the use_cropbox argument might be used, but modifying it has no effect.

def convert(opened_file)
    # Read PDF and convert pages to PPM image objects
    try:
        _ppm_pages = self.pdf2image.convert_from_bytes(
            opened_file.read(),
            grayscale = True
        )
    except Exception as e:
        print(f"[CreateJPEG] Could not convert PDF pages to JPEG image due to error: \n    '{e}'")
        return

    # Do stuff with _ppm_pages
    for img in _ppm_pages:
        img.show() # ...all images in that list are of the last page

Sometimes the output is an empty 1x1 image, instead, which I also haven't found a reason for. So if you have any idea what that is about, please do let me know!

Thanks in advance, Simon

EDIT: Added code.

EDIT: So, when I try this in a random notebook, it actually works fine.

I've removed a few detours I used in my original code, and now it works. Still not sure what the underlying reason was though...

All the same, thanks for your help, everyone!

2

There are 2 best solutions below

4
On

I'm using this right now....

from pdf2image import convert_from_path

imgSet = convert_from_path(pathToPDF, 500)

That gives me a list of images within imgSet

1
On

I guess you have to do something like this as described in the unit tests of the package.

        with open("./tests/test.pdf", "rb") as pdf_file:
            images_from_bytes = convert_from_bytes(pdf_file.read(), fmt="jpg")
            self.assertTrue(images_from_bytes[0].format == "JPEG")