I want to know the location of all the words in the pdf page. I have been trying to find something on the web but couldn't. Can anyone help me which library (preferably in java platform) should I use?
Getting text location from pdf
466 Views Asked by Prabhjot Rai At
2
There are 2 best solutions below
1

Take a look at this tutorial : http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_structure
Basically, with PDFBox, you can aces to the PDFContent with
InputStream is = yourPDFDocument.getDocumentCatalog().getPages().get(yourPage).getContents();
and then, search for the X Y Td
line you're looking for.
I'm REALLY sure there is a simpler way to do it, but since I work a lot with the Content Stream for a project, I am only aware of this way.
Search in PDFBox's javaDocs for more details !
I hope this will help you :)
You can use Textricator, but unfortunately the documentation is not maintained so it's very difficult to make the more interesting aspects of it work. However, to just see the text locations you can use simple text mode.