How can I search for text in a PDF and get table area using Camelot?

1k Views Asked by At

I am using Camelot to extract table data from PDFs. Camelot works pretty well but I have a page with several tables and I need just one. And I want to find that one based on a regex search.

If I run the code specifying the table area, it finds the table. (If I don't specify parameters, it assumes the whole page is one table).

table = camelot.read_pdf(file, flavor="stream", pages='5', table_areas=['20, 530, 550, 350'], row_tol=15)

camelot.plot(table[0], kind='contour')

The blue boxes are text. I only care about the text table in the red box.

enter image description here

My question: given that I know the text I'm searching for, how can I search and get the approximate table area, which I then pass along to Camelot? I already have working code to search for regex (PyMuPDF).

Since Camelot returns the text, I have to think there's a way to know the box coordinates but I can't see it from looking at their documentation, which is here:

https://camelot-py.readthedocs.io/en/master/api.html#lower-level-classes

I'm sure there is an OpenCV solution but I wanted to do with Camelot first if possible. Appreciate any help. Thank you.

0

There are 0 best solutions below