pdfplumber: Merging multi-line cells into one

150 Views Asked by At

So I'm struggling a little bit on one particular aspect of my data extraction script, i've successfully got the script to extract the data but i'm not sure how to deal with merging multi-line cells. For example: I have some data that is spread across two rows (highlighted)

enter image description here

but i want them merged in to one row in to the appropriate column. so instead of:

AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L.

it reads AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L.

I want this to apply across all columns.

Here is the script that i've written so far:

import pdfplumber
import pandas as pd

pdf_path = "C:/Users/xxxx/Downloads/relev-aronefs-11-12-2023.pdf"
all_rows = []
explicit_vertical_lines = [45, 100, 250, 410, 480, 640, 780]

with pdfplumber.open(pdf_path) as pdf:
    for page_num in range(len(pdf.pages)):
        page = pdf.pages[page_num]

        if page_num == 0:
            top_region = 50 
        else:  # For all other pages
            top_region = 0

        cropped_page = page.crop((0, top_region, page.width, page.height))

        table = cropped_page.extract_table({
            "vertical_strategy": "explicit",
            "explicit_vertical_lines": explicit_vertical_lines,
            "horizontal_strategy": "text",
        })

        if table:
            if page_num == 0:
                all_rows += table[1:] 
            else:
                all_rows += table

df = pd.DataFrame(all_rows)

df.columns = ['Immat', 'Constructeur', 'Type d’aéronef', 'SN aéronef', 'Propriétaire', 'Exploitant']

output_csv_path = "C:/Users/xxxx/Downloads/extracted_table_data.csv"
df.to_csv(output_csv_path, index=False)

print(df.head())

If you want to download the pdf i used then its available here:

https://dac.gouvernement.lu/content/dam/gouv_dac/documents/navigabilit%C3%A9/releve-aeronefs/2023/relev-aronefs-11-12-2023.pdf

Any help/guidance would be appreciated.

2

There are 2 best solutions below

0
Timeless On BEST ANSWER

If you can't avoid the empty rows upfront, here is a quick solution :

out = (
    df
        # to get rid of the duplicated headers and footers
        .query("Immat != 'Immat' and ~Exploitant.str.contains('Page')")
        .pipe( # here we ffill the Immat, then groupby & agg/join
            lambda x: x.groupby(
                x["Immat"].replace(
                    {"": None}).ffill(), sort=False)
                .agg(lambda s: " ".join(s).strip()) # << we join the rows here
             )
        # to make a default index
        .reset_index(drop=True)
)

Output (of the highlighted entry) :

Immat Constructeur Type d’aéronef SN aéronef Propriétaire Exploitant
2 LX-AIB THE NEW PIPER AIRCRAFT, INC PA28-161 Cadet 28-41086 AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L. AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L.

Ouptut (truncated):

print(out)

      Immat Constructeur Type d’aéronef SN aéronef Propriétaire   Exploitant
0    LX-ABC  BOMBARDI...  BD-700-1...         9213  DMH AVIA...  GLOBAL J...
1    LX-AIA  THE NEW ...        PA18C      18-1011  PIPER CL...  PIPER CL...
2    LX-AIB  THE NEW ...  PA28-161...     28-41086  AÉRO-SPO...  AÉRO-SPO...
3    LX-AIC  REIMS AV...        F172L         0852  AÉRO-SPO...  AÉRO-SPO...
4    LX-AID  REIMS AV...        F172N         1972  AÉRO-SPO...  AÉRO-SPO...
..      ...          ...          ...          ...          ...          ...
267  LX-XMR  STORCH A...  Slepcev ...          136  PROPRIÉT...  EXPLOITA...
268  LX-YCV  BOEING C...    B747-4R7F        35805  CARGOLUX...  CARGOLUX...
269  LX-ZAP  PILATUS ...        PC-24          315  ZAPLEX S.A.  JETFLY A...
270  LX-ZEN  CIRRUS D...         SR22         2972  PROPRIÉT...  EXPLOITA...
271  LX-ZIO  BOMBARDI...  BD-700-1...         9369  ER FRANZ...  GLOBAL J...

[272 rows x 6 columns]
0
حمزة نبيل On

Try to set the horizontal_strategy to explicit and provide the explicit_horizontal_lines as you did with vertical_strategy

Let assume that 30 is the height of each row :

table = cropped_page.extract_table({
    "vertical_strategy": "explicit",
    "explicit_vertical_lines": explicit_vertical_lines,
    "horizontal_strategy": "explicit",
    # start from 1 to skip the headers (each page contain header)
    "explicit_horizontal_lines": [top_region + 30 * i for i in range(1, 18)],
})

if table:
    all_rows += table

Output :

index,Immat,Constructeur,Type d’aéronef,SN aéronef,Propriétaire,Exploitant
0,LX-ABC,BOMBARDIER INC.,BD-700-1A10 GLOBAL EXPRESS XRS,9213,DMH AVIATION LTD.,GLOBAL JET LUXEMBOURG S.A.
1,LX-AIA,"THE NEW PIPER AIRCRAFT, INC",PA18C,18-1011,PIPER CLUB ASBL,PIPER CLUB ASBL
2,LX-AIB,"THE NEW PIPER AIRCRAFT, INC",PA28-161 Cadet,28-41086,"AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L.","AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L."
3,LX-AIC,"REIMS AVIATION, CESSNA",F172L,0852,"AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L.","AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L."
4,LX-AID,"REIMS AVIATION, CESSNA",F172N,1972,"AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L.","AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L."