I'm building a python program using Camelot that extracts tables from a PDF (see code below). I am able to successfully execute the code, but I am hitting a road block on how to get a better output result. Specifically, I'm trying to get the code to create a better output for the second table on page 2 (see PDF link).
PDF Link: https://gzhls.at/blob/ldb/d/3/8/d/2e6caff67e5550c07775014eadfd481b39fc.pdf
The extracted table is saved as an excel with the following output result:
I noticed that the extracted result is not quite flowing right when it comes to the alignment of text to match the table in the PDF, which can be confusing when interpreting the data. Another issue is the text "Single Package" and "Single-sided" are combined in row 5 and should (as I picture it) be on its own row.
Any thoughts on how to optimize my code would be appreciated.
Here's my code (please excuse all my notes):
import camelot
import pandas as pd
import xlsxwriter as writer
# Replace 'input.pdf' with the name of your PDF file
# Replace [1, 2, 3, or "all"] with the page numbers to extract tables from
tables = camelot.read_pdf('cSSD_BG3_Series_EOL.pdf', pages='all', layout_kwargs={'detect_vertical': True, 'all_texts': True}, flavor= 'stream', split_text = True, flag_size = True, edge_tol = 900, row_tol = 10, column_tol = 15) #line_scale = 50, shift_text=['']
#Note: Use "stream" to parse tables that have white spaces between cells to simulate a table structure, "line_scale" and "shift_text" cannot be used with 'stream'
#Note: edge_tol,row_tol cannot be used with flavor='lattice'
#Note: use process_background = True when using "lattice"
camelot.plot(tables[0], kind = 'textedge') #see note below
#Note ^ using the "kind" keyword augment, you can can designate the following plot types: 'text', 'grid', 'line', 'joint', 'contour' and 'textedge' - keep in mind that 'line' and 'joint can only be used with 'lattice' and 'textedge can only be used with 'stream'
# Initialize Pandas Excel writer
writer = pd.ExcelWriter('all_tables.xlsx', engine='xlsxwriter')
# Iterate through each table found in the PDF
for i, table in enumerate(tables):
# Extract table data as a Pandas DataFrame, including headers
df = table.df
# Define sheet name based on table number
sheet_name = f'Table {i+1}'
# Write table data to Excel sheet
df.to_excel(writer, sheet_name=sheet_name, index=False)
# Save Excel file
writer.close()
print('All tables exported to all_tables.xlsx')
Have you tried pdfplumber?
It seems to give the best results from all the libraries I've tried.