How write extracted image to file object instead of to file system?

1.3k Views Asked by At

I'm using the Python pdfminer library to extract both text and images from a PDF. Since the TextConverter class by default writes to sys.stdout, I used StringIO to catch the text as a variable as follows (see paste:

def extractTextAndImagesFromPDF(rawFile):
    laparams = LAParams()
    imagewriter = ImageWriter('extractedImageFolder/')    
    resourceManager = PDFResourceManager(caching=True)

    outfp = StringIO()  # Use StringIO to catch the output later.
    device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=laparams, imagewriter=imagewriter)
    interpreter = PDFPageInterpreter(resourceManager, device)
    for page in PDFPage.get_pages(rawFile, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    device.close()    
    extractedText = outfp.getvalue()  # Get the text from the StringIO
    outfp.close()
    return extractedText 

This works fine for the extracted text. What this function also does is extracting the images in the PDF and writing them to the 'extractedImageFolder/'. This works also fine, but I now want the images to be "written to" a file object instead of to the file system, so that I can do some post processing on them.

The ImageWriter class defines a file (fp = file(path, 'wb')) and then writes to that. What I would like is that my extractTextAndImagesFromPDF() function can also return a list of file objects, instead of directly writing them to a file. I guess I also need to use StringIO for that, but I wouldn't know how. Partly also because the writing to file is happening within the pdfminer.

Does anybody know how I can return a list of file objects instead of writing the images to the file system? All tips are welcome!

1

There are 1 best solutions below

3
On

Here is a hack to allow you to provide a file pointer of your own to write to:

   # add option in aguments to supply your own file pointer
   def export_image(self, image, fp=None):
        ...
        # change this line:
        # fp = file(path, 'wb')
        # add instead:
        fp = fp if fp else file(path, 'wb')
        ...
        # and this line:
        # return name
        # add instead:
        return (fp, name,) if fp else name

Now you would need to use:

# create file-like object backed by string buffer
fp = stringIO.stringIO()
image_fp, name = export_image(image, fp)

and your image should be stored in fp.

Note that the behaviour to export_image, if it was used elsewhere, remains the same.