Repairing corrupted JPEG images from character replacement

168 Views Asked by At

Recently I got some corrupted JPEG images after a mistakingly input command:

~$> sed -i 's/;/_/g' *

After that, in the working directory and the subdirectories, Every byte '0x3b' in JPEG images became '0x5f'. Viewer apps displays the images corrupted, such as below: corrupted image sample

I could not identify which byte should be recovered, and when I tried to validate the warning/error flags from the images with toolkits such as EXIFtool, they just returns OK as the corrupted JPEG is not literally BROKEN not to be opened by a viewer.

Images should be repaired, since there is no duplicated image backup for them, but I don't know how to start. Just replacing 0x5f with 0x3b again is not effective, since the number of cases would be too big (2^n I guess where there are n candidate 0x5f) to take the trial-and-error replacing way. I've just started parsing huffman table in a JPEG image header and hoping to identify the conflict point between huffman coded statement and binary, but not sure.

How can I recover the images in this situation? I appreciate your help.

1

There are 1 best solutions below

0
Mark Setchell On

There appear to be 57 incidences of 0x5f in your corrupted image. If you can't find a better way, you could maybe "eyeball" the effects of replacing the incorrect bytes in the image fairly quickly like this:

  • open the image in binary mode and read it all with JPEG = open('PdQpR.jpg','rb').read()

  • use offsets = [m.start() for m in re.finditer(b'_', JPEG)] to find the byte offsets of the 57 occurrences

  • display the image with cv2.imdecode() and cv2.imshow() and then enter a loop accepting keypresses with cv2.waitkey()

    p = move to previous one of 57 occurrences

    n = move to next one of 57 occurrences

    SPACE = toggle between 0x5f and 0x3b

    s = save current state

    q = quit

I had a quick attempt at this but haven't had much success using it yet:

#!/usr/bin/env python3

import cv2
import re
import numpy as np

# Load the image
filename = 'PdQpR.jpg'
JPEG = open(filename,'rb').read()
JPEG = bytearray(JPEG)

# Find the byte offsets of all the underscores
offsets = [m.start() for m in re.finditer(b'_', JPEG)]
N = len(offsets)
index = 0

while True:
    # Show user which entry we are at
    print(f'{index}/{N}: n=next, p=previous, space=toggle, q=quit')
    # Decode and display the JPEG
    im = cv2.imdecode(np.frombuffer(JPEG, dtype=np.uint8), cv2.IMREAD_COLOR)
    cv2.imshow(filename, im)

    key = cv2.waitKey(0)
    # n = next offset
    if key == ord('n'):
       index = (index + 1) % N
       next
    # p = previous offset
    if key == ord('p'):
       index = index -1 
       if index < 0:
           index = N - 1
       next
    # q = Quit
    if key == ord('q'):
       break
    # space = toggle between underscore and semicolon
    if key == ord(' '):
       if JPEG[offsets[index]] == ord('_'):
           print(f'{index}/{N}: Toggling to ;')
           JPEG[offsets[index]] = ord(';')
       else:
           print(f'{index}/{N}: Toggling to _')
           JPEG[offsets[index]] = ord('_')
       next

Note: Toggling some bytes between '_' and ';' results in illegal images and error messages from cv2.imdecode() and/or cv2.imshow(). Ideally you would wrap these inside a try/except and back out the last change if they occur. I didn't do that, yet.

Note: I didn't implement save function, it is just something like open('corrected.jpg', 'wb').write(JPEG)