I was recently working on a task to modify a python code that would get a pdf and based on the requirements, split the pdfs and output them into a folder. The code is able to read through the pdfs, and gives me the amount of pages that it went through in order to create the files, but the output does not create a the documents inside of the file path.
After making changes to allow it to read through the pdfs better, I was no longer able to get any outputs.
import re
import time
import os
from PyPDF2 import PdfReader, PdfWriter
import fitz
import parameters
input_dir = parameters.inputs_foldername+'\Certificates'
try:
os.mkdir(input_dir)
except:
pass
output_dir = 'Outputs_'+parameters.batch_name+'\Split Certificates'
isExist = os.path.exists(output_dir)
if not isExist:
os.makedirs(output_dir)
for path in os.listdir(input_dir):
full_path = os.path.join(input_dir, path)
t0 = time.time()
i = 0
new = True
pdf_writer = None # Initialize pdf_writer
parid = None # Initialize parid
with fitz.open(full_path) as doc:
pdf = PdfReader(full_path)
for i in range(len(pdf.pages)):
text = doc[i].get_text()
if ("Page 1 of" not in text) and (new == True):
try:
tmp = re.search(r"(?<=\* \* \*\n)\d{7}", text)
parid = tmp.group()
except AttributeError:
print(f"Pattern not found in text: {text}")
continue
if pdf_writer is not None:
with open(output_dir+'/'+str(parid)+'.pdf', "wb") as out:
pdf_writer.write(out)
pdf_writer = PdfWriter()
pdf_writer.add_page(pdf.pages[i])
i += 1
new = False
elif ("Page 1 of" not in text) and (new == False):
pdf_writer.add_page(pdf.pages[i])
elif ("Page 1 of" in text) and (new == False):
with open(output_dir+'/'+str(parid)+'.pdf', "wb") as out:
pdf_writer.write(out)
new = True
# Save the last pdf_writer after the loop
if pdf_writer is not None:
with open(output_dir+'/'+str(parid)+'.pdf', "wb") as out:
pdf_writer.write(out)
t1 = time.time()
print(str(i+1)+" pages processed in " + str(int(t1-t0)) + " seconds.")
As I mentioned in my comments, it's hard to help without knowing your PDFs or the details, but I think you might maybe be looking for something like this... maybe. The idea is to have a function that processes the PDF and
yield
s pairs ofparid
(whatever that is) and the related page; the other function usesitertools.groupby
to collate those into groups (assuming the pages are sequential perparid
; I think that was the assumption in the original code too) and copy them out to a writer.