pdfplumber table extraction inconsistent formatting of rows

Question

pdfplumber table extraction inconsistent formatting of rows

107 Views Asked by Mark k At 16 January 2024 at 12:34

So i'm trying to scrape the following pdf: https://www.mcw.gov.cy/mcw/dca/dca.nsf/All/6D0FA9590337C56CC22589FD001CD4FB/$file/AIRCRAFT%20REGISTER%20%2030%20JUN%202023%20public.pdf

I've written a python script using pdfplumber that seems to get me most of the way there but what i'm struggling with is around the formatting of the data into neat rows that match the original pdf.

Some columns like the S/N and REG MARKS columns are merged together. Other times the AIRCRAFT OWNER / OPERATOR column is split into two (like on the 5B-DDH and 5B-DDR aircraft)
Some multiline cells end up getting split as well.

The script is show below:

import pandas as pd
import pdfplumber
import numpy as np

pdf_file = 'C:/Users/xxx/Downloads/AIRCRAFT REGISTER  30 JUN 2023 public.pdf'
all_tables = []

with pdfplumber.open(pdf_file) as pdf:
    for page in pdf.pages:
        cropped_page = page.crop(bbox=(0, 0, 825, 535))

        table_settings = {
            "vertical_strategy": "text",
            "horizontal_strategy": "text",
        }

        table = cropped_page.extract_table(table_settings=table_settings)

        if table:
            for row in table:
                if row[0] or not all_tables:
                    all_tables.append(row)
                else:
                    all_tables[-1] = [f"{a} {b}".strip() for a, b in zip(all_tables[-1], row)]

df_tables = pd.DataFrame(all_tables)

df_tables.replace([None, np.nan], '', inplace=True)

df_tables.reset_index(drop=True, inplace=True)

print(df_tables.head())

csv_file = 'C:/Users/xxx/Downloads/Cyprus_register_cleaned.csv'  
df_tables.to_csv(csv_file, index=False, encoding='utf-8')

Any help, strategy or guidance on how I can clean this data up post-extraction would be appreciated

Original Q&A

There are 1 best solutions below

**Cam** · Accepted Answer · 2024-01-16T16:13:37.773000

So this will get you close. It uses pymupdf which can extract tables.

import pandas as pd
import fitz

pdf_file = r'AIRCRAFT REGISTER  30 JUN 2023 public.pdf'

list_of_list = []
doc = fitz.open(pdf_file)
for page in doc:
    tabs = page.find_tables()
    if tabs.tables:
        list_of_list.append(tabs[0].extract())

columns = {
    0:'S/N',
    1:'REG MARKS',
    2:'MANUFACTURER/AIRCRAFT TYPE',
    3:'AIRCRAFT S/N',
    4:'CLASS ICAO',
    5:'MTOW/Kg',
    6:'REG. DATE',
    7:'C of A / ARC Expiry',
    8:'AIRCRAFT OWNER / OPERATOR',
    9:'ENGINE TYPE',
    10:'No of ENG',
    11:'PROPELLER TYPE',
    12:'AIRCRAFT BASE',
    13:'PAX No.'
    }
        

categories = ['page%s' %i for i in range(20)]
df = pd.concat([pd.DataFrame(d) for d in list_of_list], keys=categories)
df = df.rename(columns=columns)
df.head(30)

You will want to do some clean up maybe delete the first two rows from each page. But you might want to keep the table header which is in row zero of each page.

# Make a new column with table title
df['Title'] = df['S/N'].str.replace(r'[0-9]', '', regex=True).str.replace('S/N', '').str.replace('.', '').replace('', np.nan).ffill(axis = 0)

# Drop top two rows from each table
df = df.drop(([0,1]), level=1, axis=0)

This code will give you a multi level index so you can still keep a track of which page the data came from.

pdfplumber table extraction inconsistent formatting of rows

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PDFPLUMBER

Trending Questions

Popular # Hahtags

Popular Questions