I am using Camelot to extract table data from PDFs. Camelot works pretty well but I have a page with several tables and I need just one. And I want to find that one based on a regex search.
If I run the code specifying the table area, it finds the table. (If I don't specify parameters, it assumes the whole page is one table).
table = camelot.read_pdf(file, flavor="stream", pages='5', table_areas=['20, 530, 550, 350'], row_tol=15)
camelot.plot(table[0], kind='contour')
The blue boxes are text. I only care about the text table in the red box.
My question: given that I know the text I'm searching for, how can I search and get the approximate table area, which I then pass along to Camelot? I already have working code to search for regex (PyMuPDF).
Since Camelot returns the text, I have to think there's a way to know the box coordinates but I can't see it from looking at their documentation, which is here:
https://camelot-py.readthedocs.io/en/master/api.html#lower-level-classes
I'm sure there is an OpenCV solution but I wanted to do with Camelot first if possible. Appreciate any help. Thank you.