Having ghostscript leave JBIG2 files alone

310 Views Asked by At

I'm using gs to remove some bad OCR from PDFs that are essentially images of book pages with invisible text layers. The page images in some of these are encoded as JBIG2. When I run them through gs, it changes the image format to CCIT, which usually isn't bad, but can be anywhere from 10 to 20 times bigger than the JBIG2 versions.

I was looking for a way to either have gs leave them alone - like PassThroughJPEGImages - or re-encode them with MonoImageEncoder, but I was unsuccessful. I didn't find any analogous passthrough option and got an error on setting the encoder to JBIG2Encode. I assume from what I did find that the latter isn't a standard option, but requires Luratech libraries.

Can anyone confirm or - preferably - explain my mistake?

2

There are 2 best solutions below

0
On BEST ANSWER

There's no current way to have Ghostscript pass JBIG2 images unchanged.

The pdfwrite device doesn't permit JBIG2Encode as a possible encoding method so you can't use that.

The result of this is that you can only use CCITTFaxEncode as the MonoImageEncode parameter.

In general JBIG2 is little if any better than CCITTFax, the exception is text where, if the content of the text is known, significant savings can be achieved by reusing segments (this is also the source of the JBIG2 decoding bug that hit the news in 2013). Sounds like your images are encoded that way, so yes, you are going to get larger images out.

0
On

This is about workarounds to remove bad OCR while leaving jbig2 in place. I'm using Linux, but I think the tools are mostly available on Windows as well.

1) A command line solution

heavily inspired by this reply, but avoiding the ghostscript step at the end:

  1. Back up your pdf.

  2. Decompress your pdf with qpdf (or pdftk)

    qpdf --qdf --object-streams=disable input.pdf editable.pdf
    

    This creates a pdf file in qdf mode, readable in text editors (that can handle large files).

  3. Remove all lines ending with Tj or TJ in a text editor or via sed:

    sed 'T[Jj]$/d' ./editable.pdf > editable-no-text.pdf
    

    Those are the pdf commands that render text strings.

    This will leave behind further placement commands like Tm and Td that are related to positioning on the page and Tr that determines the display style of the text. These do not contain any text themselves and don't take up as much space. You may remove them as well via:

    sed 'T[Jjdmr]$/d' ./editable.pdf > editable-no-text.pdf
    

    I have not had any negative side effects, but check the result before proceeding.

  4. Check that editable-no-text.pdf looks like it's supposed to.

  5. Recompress your pdf:

    qpdf --compress-streams=y --object-streams=generate editable-no-text.pdf final.pdf
    

2) A GUI solution

I used this before discovering the above. It is simpler, but more work with longer pdf files. I also assume it is safer, but you should have backups anyway.

Use Master PDF Editor (use version 4 from the end of that page, as the current version 5 has a lot of locked functions).

You can set it to select only text objects and then just select everything with Ctrl+A and remove with Del. Unfortunately, you have to do this for every page, so I would just cycle through Ctrl+A, Del, Page down.

While this is not properly scriptable, you could probably bodge it using xdotool.