Python extract text between two tables as title for the table(outside tables) from pdf with tabula

711 Views Asked by At

I am trying to extract tables from a pdf files, after trying with multiple different packages, tabula is the best one to extract the tables from my pdf file correctly. The thing is that, for each table, there is a title for it above the table (not included in the table part).

import tabula.io as tb
from tabula.io import read_pdf

file_path = ""
tables = tb.read_pdf(file_path, pages = "1")

I would like to extract the title with to each table as well, I tried using other packages, but they will also extract some text from table that I couldn't differentiate the text is inside table or outside.

*I have tried camelot as well, I know it can extract text from whole page, but this one would mess up my table format.

I would like to know if there is any way that I can extract text only outside table, or any suggestion that I can extract table and title at the same time?

Thanks!

Reference table image got from: image got from https://pspdfkit.com/guides/ios/customizing-the-interface/changing-the-document-title/

1

There are 1 best solutions below

0
On

Camelot provides dimensions of pdf via utils.get_page_layout function:

import camelot
metadata, dim = camelot.utils.get_page_layout(self.path)

The dimensions could be useful to detect coordinates of possible area for table name:

box_for_table_name = (
            table._bbox[0], 
            dim[1] - table._bbox[3] - 35, 
            table._bbox[2], 
            dim[1] - table._bbox[3] + 2
)

Via this calculation, we can convert pdf coordinates to bbox coordinates.

Not sure the calculation is fit for your case, but you can arrange it according to the font of the text and the gap between the text and the table.

Then you are able to extract the title you want using fitz:

import fitz
clip = fitz.Rect(box_for_table_name).round()
title = self.extract_text(clip=clip)