So I'm struggling a little bit on one particular aspect of my data extraction script, i've successfully got the script to extract the data but i'm not sure how to deal with merging multi-line cells. For example: I have some data that is spread across two rows (highlighted)
but i want them merged in to one row in to the appropriate column. so instead of:
AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L.
it reads AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L.
I want this to apply across all columns.
Here is the script that i've written so far:
import pdfplumber
import pandas as pd
pdf_path = "C:/Users/xxxx/Downloads/relev-aronefs-11-12-2023.pdf"
all_rows = []
explicit_vertical_lines = [45, 100, 250, 410, 480, 640, 780]
with pdfplumber.open(pdf_path) as pdf:
for page_num in range(len(pdf.pages)):
page = pdf.pages[page_num]
if page_num == 0:
top_region = 50
else: # For all other pages
top_region = 0
cropped_page = page.crop((0, top_region, page.width, page.height))
table = cropped_page.extract_table({
"vertical_strategy": "explicit",
"explicit_vertical_lines": explicit_vertical_lines,
"horizontal_strategy": "text",
})
if table:
if page_num == 0:
all_rows += table[1:]
else:
all_rows += table
df = pd.DataFrame(all_rows)
df.columns = ['Immat', 'Constructeur', 'Type d’aéronef', 'SN aéronef', 'Propriétaire', 'Exploitant']
output_csv_path = "C:/Users/xxxx/Downloads/extracted_table_data.csv"
df.to_csv(output_csv_path, index=False)
print(df.head())
If you want to download the pdf i used then its available here:
Any help/guidance would be appreciated.

If you can't avoid the empty rows upfront, here is a quick solution :
Output (of the highlighted entry) :
Ouptut (truncated):