Ghostscript makes text unsearchable after converting to pdf

774 Views Asked by At

Starting with a pdf file, in which all texts are searchable, I transform it to a new ps file with this command:

gswin64c -q -dSAFER -dNOPAUSE -dBATCH  -sDEVICE=ps2write -dDOPDFMARKS -dLanguageLevel=2 -sOutputFile="new.ps" "old.pdf"

After that I transformed the new.ps file to a pdf with this command:

gswin64c -q -r400 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite  -dSubsetFonts=false -dAutoRotatePages=/PageByPage -dAutoRot -dCompatibilityLevel=1.2 -sOutputFile="new.pdf" new.ps

In the new.pdf file I can't search for texts, although everything is visible. How can I solve this problem?

This is what i'm using: GPL Ghostscript 9.20 (2016-09-26)

Here is the output of the new.ps file:

'https://pastebin.com/HTXZJnKY'
1

There are 1 best solutions below

1
On

Firstly; don't go to PostScript and then to PDF. If you want a new PDF file make it directly from the original PDF.

You haven't supplied the file to look at, so anything I say here is speculation but.... PDF files can (and often do) contain a ToUnicode CMap. This maps character codes to Unicode code points and is a reliable way of copy/paste/search for text.

PostScript, being intended for printing (on paper) doesn't have any such mechanism. So by creating a PostScript file and then creating a new PDF file from that PostScript you are going to lose the ToUnicode information if it was present.

Further than that, if the original file lacked a ToUnicode then it may be that the character codes used simply happened to match up to ASCII. The default for both ps2write and pdfwrite is to Subset fonts. This has the effect of altering the character codes so that the first glyph gets character code 1, the second gets character code 2 and so on. So Hello becomes 0x01, 0x02, 0x03, 0x03, 0x04.

You are also using a 3 year old version of Ghostscript. The current version is 9.50 and you should upgrade to that anyway, even though it won't affect this particular situation.

Your command lines have problems; You don't need to specify LanguageLevel=2 for ps2write, that's the default. You haven't specified -dSubsetFonts=false for ps2write, so there's no point in specifying it for pdfwrite, the damage is done in the first pass. -dAuoRot won't do anything. Unless you have a good reason you shouldn't change the resolution. Setting -dDOPDFMARKS won't preserve all the 'metadata' from the PDF file into the PostScript file. A load of stuff like Outlines and annotations won't be preserved.

You have specified a very low CompatibilityLevel for pdfwrite, why is that? It's fairly pointless anyway, since you are starting from level 2 PostScript.

So in summary; don't do PDF->PS->PDF, just do PDF->PDF

If that doesn't achieve what you want you'll have to supply an example and be more specific about what your goal is here.