Lets say that i have a couple hundred of PDF file from which i have to extract each heading and the relevant text, for further processing for each heading how do I do that keeping the format of the file. I have tried PyPDF2 and the pdfminer libraries. These libraries are good in extracting text but I need to get headings and text out separately. One way could be converting the file to XML maybe that will get out the headings?
As mentioned I have tried PyPDF2 and pdfminer these are good and extract out all the text in my case but still I cannot get out heading etc for building some context.