PDFBOX for Persian document

255 Views Asked by Azadeh Fakhrzadeh At 29 August 2018 at 06:17

I want to use pdfBox to extract test from Persian pdf files, but it returns "?" for all the Persian characters (it returns correctly the Latin words in the same document).

How can I fix it? Any advice?

Original Q&A

There are 1 best solutions below

Tilman Hausherr On 01 September 2018 at 08:34

Sadly, the provided file has the persian text as vector graphics, not as text from fonts, so it cannot be extracted. You'll have to use OCR for it.

See also the text extraction FAQ:

How come I am not getting any text from the PDF document?

Text extraction from a pdf document is a complicated task and there are many factors involved that effect the possibility and accuracy of text extraction. It would be helpful to the PDFBox team if you could try a couple things.

Open the PDF in Acrobat and try to extract text from there. If Acrobat can extract text then PDFBox should be able to as well and it is a bug if it cannot. If Acrobat cannot extract text then PDFBox ‘probably’ cannot either.

It might really be an image instead of text. Some PDF documents are just images that have been scanned in. You can tell by using the selection tool in Acrobat, if you can’t select any text then it is probably an image.

PDFBOX for Persian document

There are 1 best solutions below

Related Questions in JAVA

Related Questions in PDF

Related Questions in PDFBOX

Related Questions in ARABIC

Related Questions in PERSIAN

Trending Questions

Popular # Hahtags

Popular Questions