I'm trying to extract values from a table in a PDF file, however, the formatting of this file is making it very difficult for me to extract them in any conventional form.

I've tried Tabula, and some other table extracting methods in python but none of them seem to work. I've decided to go with PDFMiner now since it shows the best results I was able to acquire and it also seems pretty customizeable.

My current method, that I developed through testing and help of Chat GPT, turned out like this:

def extract_text_with_positioning(pdf_path, word_margin=0.1, line_margin=0.5):
    text_positions = []
    
    # Adjust the word_margin in LAParams
    laparams = LAParams(word_margin=word_margin, line_margin=line_margin)
    page_layout_list = extract_pages(pdf_path, laparams=laparams)
    for page_layout in page_layout_list:
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    bbox = text_line.bbox
                    text_data = {
                        "x": bbox[0],
                        "y": bbox[1],
                        "x1": bbox[2],
                        "y1": bbox[3],
                        "text": text_line.get_text().strip()
                    }
                    text_positions.append(text_data)
    return text_positions

The idea is to get the text objects and their positions and create a dataframe based on that. However, the extract_pages function seems to be merging text pieces that should be separate together. I added the word_margin and line_margin parameters to see if it would fix anything, but it pretty much had no effect.

I'll give one example of this merging:

The return for the headers should be 8 text pieces, however, it's returning me 6 of them, with Tx MTM, Duration and R$ mm being in the same object separated by a space char

What can I do to fine tune PDF miner to get those three pieces of the header to be each their separate object?

0

There are 0 best solutions below