Cannot identify image file for BytesIO object in pdf2img - convert_from_path

75 Views Asked by At

I am retrieving pages from a pdf using convert_from_path (pdf2image) This is the error i am facing:

<ipython-input-45-4ebf020b9136> in <cell line: 1>()
      1 for pdf in list_of_pdfs:
----> 2   images = convert_from_path(pdf,first_page= 1,last_page=2)

2 frames
/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
    266                 )
    267             else:
--> 268                 images += parse_buffer_func(data)
    269     finally:
    270         if auto_temp_dir:

/usr/local/lib/python3.10/dist-packages/pdf2image/parsers.py in parse_buffer_to_ppm(data)
     26         size_x, size_y = tuple(size.split(b" "))
     27         file_size = len(code) + len(size) + len(rgb) + 3 + int(size_x) * int(size_y) * 3
---> 28         images.append(Image.open(BytesIO(data[index : index + file_size])))
     29         index += file_size
     30 

/usr/local/lib/python3.10/dist-packages/PIL/Image.py in open(fp, mode, formats)
   3281                 raise
   3282         return None
-> 3283 
   3284     im = _open_core(fp, filename, prefix, formats)
   3285 

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7820221cd290>

Here is the code I am using :

import io
from io import BytesIO
from PIL import Image
from pdf2image import convert_from_path

pdf_list = ['path_to_pdf.pdf','path_to_pdf2.pdf']
for pdf in pdf_list:
  images = convert_from_path(pdf,first_page= 1,last_page=2)

This code was working perfectly fine a few days back. I am not sure what broke now. I can't figure out why it fails for me.

2

There are 2 best solutions below

1
Mahboob Nur On

You can figure it our through exception handling like this.

from pdf2image.exceptions import PDFInfoNotInstalledError
from pdf2image.exceptions import PDFPageCountError
from pdf2image.exceptions import PDFSyntaxError

pdf_list = ['path_to_pdf.pdf', 'path_to_pdf2.pdf']

for pdf in pdf_list:
    try:
        images = convert_from_path(pdf, first_page=1, last_page=2)
  
    except PDFInfoNotInstalledError as e:
        print(f"PDFInfoNotInstalledError: {e}")
    except PDFPageCountError as e:
        print(f"PDFPageCountError: {e}")
    except PDFSyntaxError as e:
        print(f"PDFSyntaxError: {e}")
    except Exception as e:
        print(f"An error occurred: {e}")
0
Harshal Naik On

I downgraded pillow to 10.0.1. Its working now.