How parse a PDF document to array of images directly to RAM buffer

156 Views Asked by At

I'm trying to parse a huge PDF-document to list of images (each image has a bmp-format). I use ghostscript and python to parse PDF to list of numpy arrays but use very unuseful approach:

def get_imgs_gs(path_to_pdf):
        cpu_number = os.cpu_count() # get number of cores
    
        folderName = "bmp_imgs" # name of temporary folder to save images
        Path(folderName).mkdir(parents=True, exist_ok=True) # create the folder 
        absPath = os.path.abspath(folderName) # get absolute path to the folder
    
        args = [
            'gs',
            '-sDEVICE=bmpgray',
            f'-g{WIDTH}x{HEIGHT}',
            # f'-dNumRenderingThreads={cpu_number}',
            '-r247x247',
            '-dNOPAUSE',
            '-dBATCH',
            f'-sOutputFile="{absPath}/%04d.bmp"',
            path_to_pdf
        ]
        ghostscript.Ghostscript(*args) # run ghostscript
    
        content = os.listdir(absPath) # get the folder's content (list of images name)
        content.sort() # sort names to iterate by true order
    
        imgs = [None]*len(content) # read images
        for i in range(len(content)):
            imgs[i] = plt.imread(absPath + '/' + content[i])
        shutil.rmtree(absPath) # remove images
    
        return imgs

As you can see from the code above I save this images after which delete it.

So, How can I avoid this step. I tried to use ANSI-c API of gs but did't find solution. Only opportunity to get bitmaps of the images from std.

Can somebody help me? By the way I would like to improve speed (-dNumRenderingThreads={cpu_number}) but it didn't help me. May be somebody can help me.

0

There are 0 best solutions below