I'm trying to extract all the tables that are contained in a pdf document (about 250 pages). The problem is not extraction. Problem is identifying the tables. With my algo it is taking junk data too like contents, sometimes bullet points which I don't want. I specifically want tables with grid lines only.
from PyPDF2 import PdfFileWriter, PdfFileReader
from tabula import read_pdf
pages_required=[]
reader = PdfFileReader(open("input.pdf", mode='rb' ))
n = reader.getNumPages()
for page in [str(i+1) for i in range(n)]:
df=read_pdf(r"input.pdf", pages=page)
if df is not None:
pages_required.append(page)
print(pages_required)
This filters out pages for me to an extent but not completely. I need an array of only those page numbers which have tables with grid lines. Is there a way around?