How to extract only certain table from the pdf (invoice) which contains multiple tables in the structure format

421 Views Asked by Jyoti yadav At 25 June 2025 at 09:38

How to extract only one table from a pdf which contains multiple tables. I have tried using AmazonTextract but the problem is it gives me all the tables belonging to that pdf in a csv. But I need to extract only certain tables based on some conditions like text the bounding box dimensions.

A couple of other libraries I have tried apart from the paid tool is :

PyPDF2
Textract
Tika,
pdfPlumber,
pdfMiner
PDFtotext
PyMuPDF – bounding box technique
Tabula

But the problem lies when I have multiple pdfs for some open source libraries are able to read the text and give the text of the pdf but not in a structured format. Sometimes they are not able to read the pdf text because it is scanned, image pdfs.

So I decided to use AmazonText. Let me know if you have any other recommendations for libraries / paid tool which works better than amazontextract.

Original Q&A

There are 1 best solutions below

Thomas On 03 March 2023 at 00:34

The .csv files that you get from Amazon Textract are a post-processed version of the raw API output. You can use the API output to select what you need based on some criteria that you define.

Let's take the first page of your samples as an example. We use the amazon-textract-textractor package to simplify calling and parsing the response. Despite being very blurry Textract detects two tables there:

from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(profile_name="default")
document = extractor.analyze_document(
    file_source="./stackoverflow.png",
    features=[TextractFeatures.TABLES],
)
document.visualize(with_words=False)

Now you can simply filter the tables as you need, for example here we only keep the table if the width and height are both greater than 50% of the page. Then you write that table to .csv.

tables = [t for t in document.tables if t.bbox.width > 0.5 and t.bbox.height > 0.5]
with open('output.csv', 'w') as f:
    f.write(tables[0].to_csv())

How to extract only certain table from the pdf (invoice) which contains multiple tables in the structure format

There are 1 best solutions below

Related Questions in PDF

Related Questions in OCR

Related Questions in PDFTOTEXT

Related Questions in AMAZON-TEXTRACT

Related Questions in PYMUPDF

Trending Questions

Popular # Hahtags

Popular Questions