pdfplumber not picking up column & issue with multiline data

51 Views Asked by Mark k At 06 March 2024 at 15:10

So i'm struggling with two things with a pdf extraction script i've written.

The first thing being that the script isn't picking up the last column 'Serial Number' I've boxed the area I'm interested in along with the explicit vertical strategy lines that i need

As you can see from the screenshot the vertical lines neatly divide the columns and this is true on all the pages in the pdf. The boxed area also captures everything i want on all the pages.

My script is here:

import pdfplumber

pdf_file = r"C:\Users\xxxx\Downloads\Active Aircraft Register.pdf"
box = (0, 35, 980, 565)
explicit_vertical_lines = [18, 57, 127, 325, 518, 713, 830, 920, 984]

all_tables = []


with pdfplumber.open(pdf_file) as pdf:
    for page in pdf.pages:
        cropped_page = page.crop(bbox=box)
        table = cropped_page.extract_table(table_settings={
            "vertical_strategy": "explicit",
            "explicit_vertical_lines": explicit_vertical_lines,
            "horizontal_strategy": "text",
        })
        if table:
            all_tables.extend(table)

# Check if we have any tables extracted
if not all_tables:
    print("No tables found in the PDF.")
else:
    for row in all_tables[:10]:
        print(row)

The second issue is once the script completes the extract if any row has a multi-line cell in it (e.g. the Address or Type Variant) it then places each line separately with a space, is there anyway to make it all go on one line?

The PDF file that I'm using can be downloaded here to test the script: https://www.caacayman.com/wp-content/uploads/pdf/Active%20Aircraft%20Register.pdf

Original Q&A

pdfplumber not picking up column & issue with multiline data

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PDF-SCRAPING

Related Questions in PDFPLUMBER

Trending Questions

Popular # Hahtags

Popular Questions