I am trying to extract the text of page 5 in pdf.
The pdf have a font YLJAAA+CMSY10 which has no mappings (CMap) or even encodings (default encoding or /Differences).
While extracting text, after string "tetex package" CGPDFScanner returns "\x15" character which is encountered many times.
When this character is encountered current font is the above mentioned font which has nothing to extract the text from pdf string.
What is this \x15 character?
Thanks.
I found 2 (not "many") occurrences of this:
which is a number in octal – this is the number that is
\x15
in hexadecimal.The font definition for "YLJAA+CMSY10" in the PDF carries no special encoding, so it has the default encoding for "CMSY" ("Computer Modern Symbol"):
In itself, this still says nothing definitive: a PDF producer may reorder glyphs and encodings at will, as long as it does the same with the embedded font). Assuming the font set is not reordered, checking a random list of CMxx encodings shows that the character code
0x1F
could well be GREATER-THAN OR EQUAL TO (Unicode U+2265).Acrobat agrees; inspecting the font in the PDF shows that character code
21
(decimal) is named 'GREATER-THAN OR EQUAL' and looks like it as well.