So i'm struggling with two things with a pdf extraction script i've written.
The first thing being that the script isn't picking up the last column 'Serial Number'
I've boxed the area I'm interested in along with the explicit vertical strategy lines that i need
As you can see from the screenshot the vertical lines neatly divide the columns and this is true on all the pages in the pdf. The boxed area also captures everything i want on all the pages.
My script is here:
import pdfplumber
pdf_file = r"C:\Users\xxxx\Downloads\Active Aircraft Register.pdf"
box = (0, 35, 980, 565)
explicit_vertical_lines = [18, 57, 127, 325, 518, 713, 830, 920, 984]
all_tables = []
with pdfplumber.open(pdf_file) as pdf:
for page in pdf.pages:
cropped_page = page.crop(bbox=box)
table = cropped_page.extract_table(table_settings={
"vertical_strategy": "explicit",
"explicit_vertical_lines": explicit_vertical_lines,
"horizontal_strategy": "text",
})
if table:
all_tables.extend(table)
# Check if we have any tables extracted
if not all_tables:
print("No tables found in the PDF.")
else:
for row in all_tables[:10]:
print(row)
The second issue is once the script completes the extract if any row has a multi-line cell in it (e.g. the Address or Type Variant) it then places each line separately with a space, is there anyway to make it all go on one line?
The PDF file that I'm using can be downloaded here to test the script: https://www.caacayman.com/wp-content/uploads/pdf/Active%20Aircraft%20Register.pdf
