PDFBOX for Persian document

255 Views Asked by At

I want to use pdfBox to extract test from Persian pdf files, but it returns "?" for all the Persian characters (it returns correctly the Latin words in the same document).

How can I fix it? Any advice?

1

There are 1 best solutions below

0
Tilman Hausherr On

Sadly, the provided file has the persian text as vector graphics, not as text from fonts, so it cannot be extracted. You'll have to use OCR for it.

See also the text extraction FAQ:

How come I am not getting any text from the PDF document?

Text extraction from a pdf document is a complicated task and there are many factors involved that effect the possibility and accuracy of text extraction. It would be helpful to the PDFBox team if you could try a couple things.

Open the PDF in Acrobat and try to extract text from there. If Acrobat can extract text then PDFBox should be able to as well and it is a bug if it cannot. If Acrobat cannot extract text then PDFBox ‘probably’ cannot either.

It might really be an image instead of text. Some PDF documents are just images that have been scanned in. You can tell by using the selection tool in Acrobat, if you can’t select any text then it is probably an image.