pdf2image: how to remove the '0001' in jpg file names?

630 Views Asked by At

My goal is to convert a multi page pdf file into a number of .jpg files, in such a way that the images are directly written to the hard-disk/SSD in stead of stored into memory.

In python 3.11 :

from pdf2image import convert_from_path
poppler_path = r".\poppler-22.12.0\Library\bin"  

images = convert_from_path('test.pdf', output_folder='.', output_file = 'test', 
         poppler_path=poppler_path, paths_only = True)

pdf2image generates files with the following names 'test_0001-1.jpg', 'test_0001-2.jpg', etc

Problem: I would like to have the files have names without the suffix '_0001-' (eg. 'test1.jpg').

The only way so far seems to be to use convert_from_path WITHOUT output_folder and then save each images by images.save. But in this way the images are stored first into memory, which easyly can become a lot of Mbytes.

Is it possible to change the way pdf2image generates the file names when saving images directly to files?

3

There are 3 best solutions below

1
On BEST ANSWER

Just use poppler utilities direct (or xpdf pdftopng) so simply call it via a shell (add other options like -r 200 as desired for resolutions other than 150)

I recommend PNG as better image fidelity, however if you want .jpg replace "-png" below with "-jpg" (direct answer as asked would be pdftoppm -jpg -f 1 -l 9 -sep "" test.pdf "test") but do follow the below enhancement for file sorting. Windows file sorting needs leading zeros otherwise sort in zip or folder is 1,10,11...2,20...., which is often undesirable.

"path to bin\pdftoppm" -png "path to \in.pdf" "name"

Result =

  • name-1.png
  • name-2.png etc.

adding digits is limited compared to other apps so if you want "name-01.png" you need to only output pages 1-9 as

\bin>pdftoppm -png -f 1 -l 9 -sep "0" in.pdf "name-"

then for pages 10 to ## use say for up to 99 page file use default (it will only use the page numbers that are available)

\bin>pdftoppm -png -f 10 -l 99 in.pdf "name"

thus for 12 pages this would produce only -10 -11 and -12 as required

likewise, for up to 9999 pages you need 4 calls, if you don't want - simply delete it. For different output directory adjust output accordingly.

set "name=%~dpn1"
set "bin=path to Poppler\Release-22.12.0-0\poppler-22.12.0\Library\bin"

"%bin%\pdftoppm" -png -r 200 -f 1 -l 9 -sep "0" "%name%.pdf" "%name%-00"
"%bin%\pdftoppm" -png -r 200 -f 10 -l 99 -sep "0" "%name%.pdf" "%name%-0"
"%bin%\pdftoppm" -png -r 200 -f 100 -l 999 -sep "0" "%name%.pdf" "%name%-"
"%bin%\pdftoppm" -png -r 200 -f 1000 -l 9999 -sep "" "%name%.pdf" "%name%-"

in say example for 12 page above the worst case would be last calls replies
Wrong page range given: the first page (100) can not be after the last page (12). and same for 1000 Thus, those warnings can be ignored.

Those 4 lines could be in a windows or OS script batch file (for sendto or drag and drop) that accepts arguments then very simply use in system or python by call pdf2png.bat input.pdf for each file and output will in that simple case be same directory.

1
On

I'm not familiar if Poppler already has some parameters to customize the generated file names, but you can always do this:

  1. Run the command in an empty directory (e.g. in tempfile.TemporaryDirectory())
  2. After command finishes, list the contents of the directory and store the result in a list
  3. Iterate over the list with a regex that will match the numbers, and create a dict for the mapping (integer to file name)

At this point you are free to rename the files to whatever you like, or to process them.

The benefit of this solution is that it's neutral, robust and works for many similar scenarios.

3
On

hi have a look at your codebase in file generators.py ,

I got mine from def counter_generator(prefix="", suffix="", padding_goal=4):

at line 41 you have :


....
@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
    """Returns a joined prefix, iteration number, and suffix"""
    i = 0
    while True:
        i += 1
        yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)

....

think you need to play with the yield line zfill() :

The Python String zfill() method is used to fill the string with zeroes on its left until it reaches a certain width; else called Padding. If the prefix of this string is a sign character (+ or -), the zeroes are added after the sign character rather than before.

The Python String zfill() method does not fill the string if the length of the string is greater than the total width of the string after padding.

Note: The zfill() method works similar to the rjust() method if we assign '0' to the fillchar parameter of the rjust() method.

https://www.tutorialspoint.com/python/string_zfill.htm