How to extract only certain table from the pdf (invoice) which contains multiple tables in the structure format

409 Views Asked by At

enter image description here

How to extract only one table from a pdf which contains multiple tables. I have tried using AmazonTextract but the problem is it gives me all the tables belonging to that pdf in a csv. But I need to extract only certain tables based on some conditions like text the bounding box dimensions.

A couple of other libraries I have tried apart from the paid tool is :

  1. PyPDF2
  2. Textract
  3. Tika,
  4. pdfPlumber,
  5. pdfMiner
  6. PDFtotext
  7. PyMuPDF – bounding box technique
  8. Tabula

But the problem lies when I have multiple pdfs for some open source libraries are able to read the text and give the text of the pdf but not in a structured format. Sometimes they are not able to read the pdf text because it is scanned, image pdfs.

So I decided to use AmazonText. Let me know if you have any other recommendations for libraries / paid tool which works better than amazontextract.

1

There are 1 best solutions below

0
On

The .csv files that you get from Amazon Textract are a post-processed version of the raw API output. You can use the API output to select what you need based on some criteria that you define.

Let's take the first page of your samples as an example. We use the amazon-textract-textractor package to simplify calling and parsing the response. Despite being very blurry Textract detects two tables there:

from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(profile_name="default")
document = extractor.analyze_document(
    file_source="./stackoverflow.png",
    features=[TextractFeatures.TABLES],
)
document.visualize(with_words=False)

visualize of the document with two tables

Now you can simply filter the tables as you need, for example here we only keep the table if the width and height are both greater than 50% of the page. Then you write that table to .csv.

tables = [t for t in document.tables if t.bbox.width > 0.5 and t.bbox.height > 0.5]
with open('output.csv', 'w') as f:
    f.write(tables[0].to_csv())