Difficulty in Accurately Extracting Table Column Names Using tabula, camelot, or pdfplumber for Complex PDFs

24 Views Asked by At

I am encountering an issue with extracting tabular data from certain complex PDF files using Python libraries such as tabula, camelot, and pdfplumber. While these libraries generally perform well in extracting tables, I have observed that in some cases, they incorrectly identify the column names. Specifically, for certain PDFs, the libraries seem to be assigning the column names to the line preceding the actual column headers.

I have attempted to use various approaches, including tabula, camelot, and pdfplumber, to extract tabular data from the PDF files. However, despite experimenting with different parameters and configurations, I have not been able to resolve this issue for these specific PDFs.

def extract_tabular_data(filepath):
    output_directory = "files/csv_files"
    os.makedirs(output_directory, exist_ok=True)  # Ensure directory exists
    csv_files = []

    # Extract tables from the PDF file
    tables = tabula.read_pdf(filepath, pages='all', multiple_tables=True)

    for page_number, table in enumerate(tables):
        output_filename = os.path.join(output_directory, f"page_{page_number + 1}_table_1.csv")
        table.to_csv(output_filename, index=False, encoding='utf-8')
        csv_files.append(output_filename)

    return csv_files

def extract_non_tabular_data(filepath):
    non_tabular_data = []
    with open(filepath, "rb") as pdf_file:
        reader = PyPDF2.PdfReader(pdf_file)
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text = page.extract_text()
            non_tabular_data.append(text)
    return non_tabular_data

To provide context and aid in troubleshooting, I have shared a sample PDF document where this issue occurs. You can find the PDF document here:https://drive.google.com/file/d/1TCXf8ySefgaxPmOpf19Py2JqQoXN_G0C/view?usp=sharing.

0

There are 0 best solutions below