How to parse an entropy-coded JPEG block efficiently?

461 Views Asked by At

I'm just trying to jump through a SOS_MT block in a .JPEG file, I don't want to use the data for anything, I just want to know where it ends. According to what I understand from JPEG's article in Wikipedia, while all other blocks in the JPEG file start with a few bytes that indicate the blocks's length, a SOS_MT block is ... well, an evil swamp that you have no option but to parse byte-by-byte until you get to the end of it.

So I came with the following code to do just that:

entropyCoded :: Parser Int
entropyCoded = do
    list_of_lengths <-  many' $
         (
           do
             _ <- notWord8 0xFF
             return 1
         )
         <|>
         (
           do
             _ <- word8 0xFF
             _ <- word8 0
             return 2
         )
         <|>
         (
           do
             l <- many1 (word8 0xFF)
             _ <- satisfy (\x -> ( x >= 0xD0 && x < 0xD7 ))
             return $ 1 + length l
         )
         <|>
         (
           do
             _ <- word8 0xFF
             maybe_ff <- peekWord8'
             if maybe_ff == 0xFF
               then
                 return 1
               else
                 fail "notthere"
         )
    foldM (\ nn n -> nn `seq` return (nn + n) ) 0 list_of_lengths

This code uses Atoparsec and as far as I have had the chance to verify it, it is correct. It is just slow. Any tips on how to improve, performance-wise, this parser?

2

There are 2 best solutions below

0
On BEST ANSWER

If you want to skip over an SOS market, just look for the next marker that is not a restart marker.

Read bytes until you find and FF. If the next value 00, it is a compressed FF value and skip over it. If it's a restart marker skip over it. Otherwise, FF should start the next block.

0
On

a small addition to the previous answer, according to the ISO/IEC 10918-1 : 1993(E) standard:

B.1.1.5 Entropy-coded data segments

An entropy-coded data segment contains the output of an entropy-coding procedure. It consists of an integer number of bytes, whether the entropy-coding procedure used is Huffman or arithmetic.

NOTE 1

Making entropy-coded segments an integer number of bytes is performed as follows: for Huffman coding, 1-bits are used, if necessary, to pad the end of the compressed data to complete the final byte of a segment. For arithmetic coding, byte alignment is performed in the procedure which terminates the entropy-coded segment (see D.1.8).

NOTE 2

In order to ensure that a marker does not occur within an entropy-coded segment, any X'FF' byte generated by either a Huffman or arithmetic encoder, or an X'FF' byte that was generated by the padding of 1-bits described in NOTE 1 above, is followed by a "stuffed" zero byte (see D.1.6 and F.1.2.3).

so, when you enconter 0xFF in the entropy-encoded part at position N, read forward one more byte. if the next byte is 0x00 then it's a "stuffed" zero. if it's another 0xFF then there is a padding, re-check from N+1. every other byte (0x01-0xFE) is part of the next marker.