pdfBox Return Bad Encoding Charachter

1.8k Views Asked by At

i have a pdf http://www.persianacademy.ir/UserFiles/File/fe1394.pdfthat i want to extract words from it(contain persian words.).i use PDFBox library to get words.here is my code:

package ir.blog.stack;

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class PDFManager {

    public static void main(String[] args) {
        PDFManager pdfManager = new PDFManager();
        pdfManager.setFilePath("/home/saeed/Documents/words.pdf");
        try {
            System.out.println(pdfManager.ToText());
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    private PDFParser parser;
    private PDFTextStripper pdfStripper;
    private PDDocument pdDoc ;
    private COSDocument cosDoc ;

    private String Text ;
    private String filePath;
    private File file;

    public PDFManager() {

    }
    public String ToText() throws IOException
    {
        this.pdfStripper = null;
        this.pdDoc = null;
        this.cosDoc = null;

        file = new File(filePath);
        parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0

        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        pdDoc.getNumberOfPages();
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(5);

        // reading text from page 1 to 10
        // if you want to get text from full pdf file use this code
        // pdfStripper.setEndPage(pdDoc.getNumberOfPages());
        Text = pdfStripper.getText(pdDoc);
        return Text;
    }

    public void setFilePath(String filePath) {
        this.filePath = filePath;
    }

}

and this is part of output:

° ǽA ° SwA ²j±ÇÇM/SwA ²joÇ Ak¼ÇQ ³Ç«AjA p°oÇ«A ³ÇM BÇU éÇ
BÇM ¤ Ø°A ·ª¦ °j ³ An <»wB®{Sv½p> ° <»wB®z¯BMp> ,<³¯BhQBa> ,<³¯BiRnB\U>
»¯BwC³ÇM ©½o¼¢Moǯnj kǯA²k{ ³TiBw <»wB®{> BM ¨°j ·ª¦ °j ° <³¯Bi> ·ª¦
k{BÇM ³TÇ{Aj j±]° o¯ ßB
UA ¬C nj ³ ºA²kîB RBª¦ ½A ߺÀ«A ³ ©¼MB½»«nj
/jnAk¯
° ²k{tBLTA »¼® Øßi pA j±i »Moî Øßi ° ²k{ ³To£ »Moî Øßi pA B« Øßi

shall i do extra actions to get right words?

1

There are 1 best solutions below

1
On

The PDF in question simply does not contain the information required for text extraction. You will have to try with OCR.

In detail

For text extraction from a PDF to succeed, the PDF must contain some information on which Unicode character is represented by each used glyph.


The PDF specification describes the following text extraction process:

9.10.2 Mapping Character Codes to Unicode Values

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

  • If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

    a) Map the character code to a character identifier (CID) according to the font’s CMap.

    b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

    c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

    d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

    e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

In case of the sample PDF, the fonts in question

  • do not have ToUnicode maps;
  • are composite;
  • use Identity-H as Encoding;
  • have a CIDSystemInfo value of Adobe-Identity-0.

Thus, the process quoted above fails to produce a Unicode value.


The PDF specification alternatively allows the use of ActualText entries in structure element dictionaries or marked-content sequences to override the text some content shall represent.

In case of the sample PDF, no ActualText entries are used.


One can look deeper than the PDF specification describes, in particular one can dive into the embedded font programs to find font specific information on the Unicode characters some font glyph represents.

In case of the sample PDF, the embedded font programs

  • do not contain Unicode values for the glyphs;
  • use uninformative glyph names like "glyph89".

Thus, in case of the sample PDF, you most likely will have to resort to OCR.