How to extract text from pdf with complex layouts using python?

126 Views Asked by At

I am extracting text from pdf but it's hard to extract for the complex layouts like a 2-column pdf and different scenarios of pdf's in a table like table with borders or no borders, and combined scenarios like table in a two column or adjacent table . it is getting hard when all layouts are in one pdf. Is there a way to overcome this issue of extracting text from pdf without loosing its structure.

I tried by getting it's layout dictionary of a pdf using PyMuPDF with it's co-ordinates or bbox but i couldn't differenciate between different layouts of pdf.

0

There are 0 best solutions below