i have a pdf http://www.persianacademy.ir/UserFiles/File/fe1394.pdfthat i want to extract words from it(contain persian words.).i use PDFBox library to get words.here is my code:
package ir.blog.stack;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFManager {
public static void main(String[] args) {
PDFManager pdfManager = new PDFManager();
pdfManager.setFilePath("/home/saeed/Documents/words.pdf");
try {
System.out.println(pdfManager.ToText());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
public PDFManager() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
and this is part of output:
° ǽA ° SwA ²j±ÇÇM/SwA ²joÇ Ak¼ÇQ ³Ç«AjA p°oÇ«A ³ÇM BÇU éÇ
BÇM ¤ Ø°A ·ª¦ °j ³ An <»wB®{Sv½p> ° <»wB®z¯BMp> ,<³¯BhQBa> ,<³¯BiRnB\U>
»¯BwC³ÇM ©½o¼¢Moǯnj kǯA²k{ ³TiBw <»wB®{> BM ¨°j ·ª¦ °j ° <³¯Bi> ·ª¦
k{BÇM ³TÇ{Aj j±]° o¯ ßB
UA ¬C nj ³ ºA²kîB RBª¦ ½A ߺÀ«A ³ ©¼MB½»«nj
/jnAk¯
° ²k{tBLTA »¼® Øßi pA j±i »Moî Øßi ° ²k{ ³To£ »Moî Øßi pA B« Øßi
shall i do extra actions to get right words?
The PDF in question simply does not contain the information required for text extraction. You will have to try with OCR.
In detail
For text extraction from a PDF to succeed, the PDF must contain some information on which Unicode character is represented by each used glyph.
The PDF specification describes the following text extraction process:
In case of the sample PDF, the fonts in question
Thus, the process quoted above fails to produce a Unicode value.
The PDF specification alternatively allows the use of ActualText entries in structure element dictionaries or marked-content sequences to override the text some content shall represent.
In case of the sample PDF, no ActualText entries are used.
One can look deeper than the PDF specification describes, in particular one can dive into the embedded font programs to find font specific information on the Unicode characters some font glyph represents.
In case of the sample PDF, the embedded font programs
Thus, in case of the sample PDF, you most likely will have to resort to OCR.