I am encountering an issue with extracting tabular data from certain complex PDF files using Python libraries such as tabula, camelot, and pdfplumber. While these libraries generally perform well in extracting tables, I have observed that in some cases, they incorrectly identify the column names. Specifically, for certain PDFs, the libraries seem to be assigning the column names to the line preceding the actual column headers.
I have attempted to use various approaches, including tabula, camelot, and pdfplumber, to extract tabular data from the PDF files. However, despite experimenting with different parameters and configurations, I have not been able to resolve this issue for these specific PDFs.
def extract_tabular_data(filepath):
output_directory = "files/csv_files"
os.makedirs(output_directory, exist_ok=True) # Ensure directory exists
csv_files = []
# Extract tables from the PDF file
tables = tabula.read_pdf(filepath, pages='all', multiple_tables=True)
for page_number, table in enumerate(tables):
output_filename = os.path.join(output_directory, f"page_{page_number + 1}_table_1.csv")
table.to_csv(output_filename, index=False, encoding='utf-8')
csv_files.append(output_filename)
return csv_files
def extract_non_tabular_data(filepath):
non_tabular_data = []
with open(filepath, "rb") as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
text = page.extract_text()
non_tabular_data.append(text)
return non_tabular_data
To provide context and aid in troubleshooting, I have shared a sample PDF document where this issue occurs. You can find the PDF document here:https://drive.google.com/file/d/1TCXf8ySefgaxPmOpf19Py2JqQoXN_G0C/view?usp=sharing.