I am currently working on iOS PDF scanning using PDFKitten. I am trying to extract text for searching in PDF having Type0 font. I am not able to extract text from the PDF. Some entries in ToUnicode are missing and some are misinterpreted. Can there be issue with parsing of the CMap? If I don't have complete CMap, how should I derive it? Can I take external entries for these missing ToUnicode entries?
Thanks
The PDF specification offers hints on how to extract text content in section 9.10.2 Mapping Character Codes to Unicode Values:
If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.
If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):
a) Map the character code to a character name according to Table D.1 and the font’s Differences array.
b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.
If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:
a) Map the character code to a character identifier (CID) according to the font’s CMap.
b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).
d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).
e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
Furthermore, as section 9.10.1 indicates,
According to the specification, if these methods fail to produce a Unicode value, there is no way to determine what the character code represents. This is not entirely true; e.g. embedded font programs may contain their own mappings to Unicode; but such additional sources of information are beyond the actual PDF format.
EDIT
The OP provided the file in question, iPhoneConfigurationProfileRef-2013-GM.pdf, via mail and indicated
As he didn't get a mapping for any glyph, let us look at the title page as an example.
The content stream contains these operation relevant for text extraction:
So we need to look only at the font G1 on page 1. Fortunately the font has a ToUnicode map:
Trying to apply this map one gets (based on the explicit
beginbfrange...endbfrange
entries):This very well matches the appearance of the page: