Identifying tables with gridlines in a pdf using python with tabula

606 Views Asked by At

I'm trying to extract all the tables that are contained in a pdf document (about 250 pages). The problem is not extraction. Problem is identifying the tables. With my algo it is taking junk data too like contents, sometimes bullet points which I don't want. I specifically want tables with grid lines only.

from PyPDF2 import PdfFileWriter, PdfFileReader
from tabula import read_pdf
pages_required=[]
reader = PdfFileReader(open("input.pdf", mode='rb' ))
n = reader.getNumPages()
for page in [str(i+1) for i in range(n)]:
    df=read_pdf(r"input.pdf", pages=page)
    if df is not None:
        pages_required.append(page)
print(pages_required)

This filters out pages for me to an extent but not completely. I need an array of only those page numbers which have tables with grid lines. Is there a way around?

0

There are 0 best solutions below