How to read out the properties of the Symbol Dictionary used by the JBIG2 algorithm in my pdf?

169 Views Asked by At

I have a PDF that contains a long list numbers, that was compressed using the JBIG2 algorithm. When I look up the the internal file structure of my file I can find, that my pages are being built with two different XObjects: Pictured is Adobe Acrobat Preflight -> Internal structure.

(Pictured is Adobe Acrobat Preflight -> Internal structure.)

I can easily look at the specifics of the first one called "XIPLAYER0" (not pictured) it even gives me the information bit by bit if I want to. The second one is the one I am interested in tho. In it I can see that the image is built using 2 "Symbol Dictionaries" (first one marked grey). Is it possible to see the different entries in this dictionary? Or maybe even get some metadata for just one of them?

Sample PDF(Outside link)

2

There are 2 best solutions below

1
On

This is not really about PDF, PDF is just the container for the JBIG2 format and its symbols dictionary, which is what you're really interested in.

But, as a first step, you'll need to get the JBIG2 images out of the PDF:

Extract images from PDF, how to handle JBIG2 encoded

That SO mentions poppler, and poppler does have a Python binding/wrapper:

https://pypi.org/project/python-poppler/

Once you get those JBIG2 files, maybe this can help:

jbig2_symbol_dict.c

The bigger project has a command-line util which has a "dump" option, but the source says it's not implemented^1:

case dump:
    fprintf(stderr, "Sorry, segment dump not yet implemented\n");
    break;

So if you're just curious/this is an academic question, the answer looks like "not really". If you need to read the text, how about OCR?

0
On

The File in question has a known problem in that the scan as JBIG2 is supposed to be highly compressed clean pixel scan without some of the issues that a jpeg may introduce when its low quality. However the format as used by some commercial scanners can notoriously infill 6 to look like 8 as seen in this sequence from page 1. see https://en.wikipedia.org/wiki/JBIG2#Disadvantages

enter image description here

For several reasons it is suggested by some organisations it not be used for critical documents where image fidelity needs to be as generated by more conventional TIFF GIF or PNG Monochrome scans.

To extract such an image requires 2 lines of code using 2 libraries

poppler\bin>pdfimages -all 7535-7pt.pdf out

and a for loop in this case 001-81 for the 243 out-puts similar to

jbig2\Library\bin>jbig2dec -o out-001 -t pbm out-001.jb2g out-001.jb2e

Meta data for first 3 pages can be seen here (where a poor 200 dpi equivalence had been used)

23.01.0\Library\bin>pdfimages -list 7535-7pt.pdf  

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1184   832  gray    1   8  jpeg   no         6  0   100   100  554B 0.1%
   1     1 stencil  1967  1230  -       1   1  jbig2  no         8  0   200   200 7885B 2.6%
   2     2 image    1184   832  gray    1   8  jpeg   no        13  0   100   100  573B 0.1%
   2     3 stencil  1966  1200  -       1   1  jbig2  no        15  0   200   200 7415B 2.5%
   3     4 image    1184   832  gray    1   8  jpeg   no        19  0   100   100  552B 0.1%
   3     5 stencil  1967  1201  -       1   1  jbig2  no        21  0   200   200 7829B 2.7%

the 81 pbm's will be a faithful copy of the poor variable inputs typically (

/MediaBox [0 0 842 596] /Rotate 270 
/Image
/BitsPerComponent 1
/Width 1967
/Height 1230
/ImageMask true
/Filter
/JBIG2Decode

) and the old 243 images can be discarded (PDF file should have been discarded anyway, and paper source rescanned at higher resolution) as images are of no use except to show the errors as above.