Getting text location from pdf

460 Views Asked by At

I want to know the location of all the words in the pdf page. I have been trying to find something on the web but couldn't. Can anyone help me which library (preferably in java platform) should I use?

2

There are 2 best solutions below

0
On

You can use Textricator, but unfortunately the documentation is not maintained so it's very difficult to make the more interesting aspects of it work. However, to just see the text locations you can use simple text mode.

./textricator.bat text --pages=2 xxx.pdf

# output is a long list of CSV properties for the document, including the OCR read text and the x,y coordinates of it.  
1
On

Take a look at this tutorial : http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_structure

Basically, with PDFBox, you can aces to the PDFContent with

InputStream is = yourPDFDocument.getDocumentCatalog().getPages().get(yourPage).getContents();

and then, search for the X Y Td line you're looking for.

I'm REALLY sure there is a simpler way to do it, but since I work a lot with the Content Stream for a project, I am only aware of this way.
Search in PDFBox's javaDocs for more details !

I hope this will help you :)