So i'm trying to scrape the following pdf: https://www.mcw.gov.cy/mcw/dca/dca.nsf/All/6D0FA9590337C56CC22589FD001CD4FB/$file/AIRCRAFT%20REGISTER%20%2030%20JUN%202023%20public.pdf
I've written a python script using pdfplumber that seems to get me most of the way there but what i'm struggling with is around the formatting of the data into neat rows that match the original pdf.
- Some columns like the S/N and REG MARKS columns are merged together. Other times the AIRCRAFT OWNER / OPERATOR column is split into two (like on the 5B-DDH and 5B-DDR aircraft)
- Some multiline cells end up getting split as well.
The script is show below:
import pandas as pd
import pdfplumber
import numpy as np
pdf_file = 'C:/Users/xxx/Downloads/AIRCRAFT REGISTER 30 JUN 2023 public.pdf'
all_tables = []
with pdfplumber.open(pdf_file) as pdf:
for page in pdf.pages:
cropped_page = page.crop(bbox=(0, 0, 825, 535))
table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "text",
}
table = cropped_page.extract_table(table_settings=table_settings)
if table:
for row in table:
if row[0] or not all_tables:
all_tables.append(row)
else:
all_tables[-1] = [f"{a} {b}".strip() for a, b in zip(all_tables[-1], row)]
df_tables = pd.DataFrame(all_tables)
df_tables.replace([None, np.nan], '', inplace=True)
df_tables.reset_index(drop=True, inplace=True)
print(df_tables.head())
csv_file = 'C:/Users/xxx/Downloads/Cyprus_register_cleaned.csv'
df_tables.to_csv(csv_file, index=False, encoding='utf-8')
Any help, strategy or guidance on how I can clean this data up post-extraction would be appreciated
So this will get you close. It uses
pymupdfwhich can extract tables.You will want to do some clean up maybe delete the first two rows from each page. But you might want to keep the table header which is in row zero of each page.
This code will give you a multi level index so you can still keep a track of which page the data came from.