How to check if Encoding and ToUnicode are properly done for a pdf?

1.7k Views Asked by At

I am using qpdf to check if Encoding and ToUnicode is properly set up (encoded) for a PDF file by using the following command and look for 'ToUnicode' word in the text file. The purpose is to make sure that ligatures within a file can be decoded properly on a PDF viewer such as Adobe Acrobat Reader, pdf.js, pdfium etc.

qpdf --stream-data=uncompress input.pdf output.txt

Is this the right way? What is recommended?

1

There are 1 best solutions below

0
On

This is quite a difficult task.

Your document can include multiple fonts, some with a ToUnicode cmap and some without and all of them can be valid.

Then for the fonts that include the ToUnicode cmap you have to check that all character IDs used with that font are also present in the ToUnicode cmap.

And last step is to check that each character id is mapped to the right character (characters for ligature). This is impossible to be done automatically because you don't know what character is represented by some id. For example glyph 'A' is represented by character id 1 when text is displayed on the page. But in the ToUnicode cmap character id 1 is mapped to character 'B'. This is a logical error that cannot be verified automatically.