Getting text location from pdf

455 Views Asked by Prabhjot Rai At 28 July 2025 at 08:53

I want to know the location of all the words in the pdf page. I have been trying to find something on the web but couldn't. Can anyone help me which library (preferably in java platform) should I use?

Original Q&A

There are 2 best solutions below

not2qubit On 25 April 2021 at 13:09

You can use Textricator, but unfortunately the documentation is not maintained so it's very difficult to make the more interesting aspects of it work. However, to just see the text locations you can use simple text mode.

./textricator.bat text --pages=2 xxx.pdf

# output is a long list of CSV properties for the document, including the OCR read text and the x,y coordinates of it.

Nefrasky On 09 December 2015 at 11:25

Take a look at this tutorial : http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_structure

Basically, with PDFBox, you can aces to the PDFContent with

InputStream is = yourPDFDocument.getDocumentCatalog().getPages().get(yourPage).getContents();

and then, search for the X Y Td line you're looking for.

I'm REALLY sure there is a simpler way to do it, but since I work a lot with the Content Stream for a project, I am only aware of this way.
Search in PDFBox's javaDocs for more details !

I hope this will help you :)

Getting text location from pdf

There are 2 best solutions below

Related Questions in PDF

Related Questions in ITEXT

Related Questions in PDFBOX

Related Questions in PDF2HTMLEX

Trending Questions

Popular # Hahtags

Popular Questions