Having ghostscript leave JBIG2 files alone

Question

Having ghostscript leave JBIG2 files alone

298 Views Asked by Eponymous At 20 August 2025 at 02:47

I'm using gs to remove some bad OCR from PDFs that are essentially images of book pages with invisible text layers. The page images in some of these are encoded as JBIG2. When I run them through gs, it changes the image format to CCIT, which usually isn't bad, but can be anywhere from 10 to 20 times bigger than the JBIG2 versions.

I was looking for a way to either have gs leave them alone - like PassThroughJPEGImages - or re-encode them with MonoImageEncoder, but I was unsuccessful. I didn't find any analogous passthrough option and got an error on setting the encoder to JBIG2Encode. I assume from what I did find that the latter isn't a standard option, but requires Luratech libraries.

Can anyone confirm or - preferably - explain my mistake?

Original Q&A

There are 2 best solutions below

hife On 21 November 2023 at 13:14

This is about workarounds to remove bad OCR while leaving jbig2 in place. I'm using Linux, but I think the tools are mostly available on Windows as well.

1) A command line solution

heavily inspired by this reply, but avoiding the ghostscript step at the end:

Back up your pdf.
Decompress your pdf with qpdf (or pdftk)
```
qpdf --qdf --object-streams=disable input.pdf editable.pdf
```
This creates a pdf file in qdf mode, readable in text editors (that can handle large files).
Remove all lines ending with Tj or TJ in a text editor or via sed:
```
sed 'T[Jj]$/d' ./editable.pdf > editable-no-text.pdf
```
Those are the pdf commands that render text strings.

This will leave behind further placement commands like Tm and Td that are related to positioning on the page and Tr that determines the display style of the text. These do not contain any text themselves and don't take up as much space. You may remove them as well via:
```
sed 'T[Jjdmr]$/d' ./editable.pdf > editable-no-text.pdf
```
I have not had any negative side effects, but check the result before proceeding.
Check that editable-no-text.pdf looks like it's supposed to.

Recompress your pdf:

qpdf --compress-streams=y --object-streams=generate editable-no-text.pdf final.pdf

2) A GUI solution

I used this before discovering the above. It is simpler, but more work with longer pdf files. I also assume it is safer, but you should have backups anyway.

Use Master PDF Editor (use version 4 from the end of that page, as the current version 5 has a lot of locked functions).

You can set it to select only text objects and then just select everything with Ctrl+A and remove with Del. Unfortunately, you have to do this for every page, so I would just cycle through Ctrl+A, Del, Page down.

While this is not properly scriptable, you could probably bodge it using xdotool.

**KenS** · Accepted Answer

There's no current way to have Ghostscript pass JBIG2 images unchanged.

The pdfwrite device doesn't permit JBIG2Encode as a possible encoding method so you can't use that.

The result of this is that you can only use CCITTFaxEncode as the MonoImageEncode parameter.

In general JBIG2 is little if any better than CCITTFax, the exception is text where, if the content of the text is known, significant savings can be achieved by reusing segments (this is also the source of the JBIG2 decoding bug that hit the news in 2013). Sounds like your images are encoded that way, so yes, you are going to get larger images out.

Having ghostscript leave JBIG2 files alone

There are 2 best solutions below

1) A command line solution

2) A GUI solution

Related Questions in GHOSTSCRIPT

Related Questions in JBIG2

Trending Questions

Popular # Hahtags

Popular Questions