I am using qpdf to check if Encoding
and ToUnicode
is properly set up (encoded) for a PDF file by using the following command and look for 'ToUnicode' word in the text file. The purpose is to make sure that ligatures within a file can be decoded properly on a PDF viewer such as Adobe Acrobat Reader, pdf.js, pdfium etc.
qpdf --stream-data=uncompress input.pdf output.txt
Is this the right way? What is recommended?
This is quite a difficult task.
Your document can include multiple fonts, some with a ToUnicode cmap and some without and all of them can be valid.
Then for the fonts that include the ToUnicode cmap you have to check that all character IDs used with that font are also present in the ToUnicode cmap.
And last step is to check that each character id is mapped to the right character (characters for ligature). This is impossible to be done automatically because you don't know what character is represented by some id. For example glyph 'A' is represented by character id 1 when text is displayed on the page. But in the ToUnicode cmap character id 1 is mapped to character 'B'. This is a logical error that cannot be verified automatically.