'UTF-8' decoding error while using unireedsolomon package

211 Views Asked by At

I have been writing a code using the unireedsolomon package. The package adds parity bytes which are mostly extended ASCII characters. I am applying bit-level errors after converting the 'special character' parities using the following code:

def str_to_byte(padded):
    byte_array = padded.encode()
    binary_int = int.from_bytes(byte_array, "big")
    binary_string = bin(binary_int)
    without_b = binary_string[2:]
    return without_b

def byte_to_str(without_b):
    binary_int = int(without_b, 2)
    byte_number = binary_int.bit_length() + 7 // 8
    binary_array = binary_int.to_bytes(byte_number, "big")
    ascii_text = binary_array.decode()
    padded_char = ascii_text[:]
    return padded_char

After conversion from string to a bit-stream I try to apply errors randomly and there are instances when I am not able to retrieve those special-characters (or characters) back and I encounter the 'utf' error before I could even decode the message.

If I flip a bit or so it has to be inside the 255 ASCII character values but somehow I am getting errors. Is there any way to rectify this ?

2

There are 2 best solutions below

7
Mark Tolonen On BEST ANSWER

It's a bit odd that the encryption package works with Unicode strings. Better to encrypt byte data since it may not be only text that is encrypted/decrypted. Also no need for working with actual binary strings (Unicode 1s and 0s). Flip bits in the byte strings.

Below I've wrapped the encode/decode routines so they take either Unicode text and return byte strings or vice versa. There is also a corrupt function that will flip bits in the encoded result to see the error correction in action:

import unireedsolomon as rs
import random

def corrupt(encoded):
    '''Flip up to 3 bits (might pick the same bit more than once).
    '''
    b = bytearray(encoded) # convert to writable bytes
    for _ in range(3):
        index = random.randrange(len(b)) # pick random byte
        bit = random.randrange(8)        # pic random bit
        b[index] ^= 1 << bit             # flip it
    return bytes(b) # back to read-only bytes, but not necessary

def encode(coder,msg):
    '''Convert the msg to UTF-8-encoded bytes and encode with "coder".  Return as bytes.
    '''
    return coder.encode(msg.encode('utf8')).encode('latin1')

def decode(coder,encoded):
    '''Decode the encoded message with "coder", convert result to bytes and decode UTF-8.
    '''
    return coder.decode(encoded)[0].encode('latin1').decode('utf8')

coder = rs.RSCoder(20,13)
msg = 'hello(你好)'  # 9 Unicode characters, but 13 (maximum) bytes when encoded to UTF-8.
encoded = encode(coder,msg)
print(encoded)
corrupted = corrupt(encoded)
print(corrupted)
decoded = decode(coder,corrupted)
print(decoded)

Output. Note that the first l in hello (ASCII 0x6C) corrupted to 0xEC, then second l changed to an h (ASCII 0x68) and another byte changed from 0xE5 to 0xF5. You can actually randomly change any 3 bytes (not just bits) including error-correcting bytes and the message will still decode.

b'hello(\xe4\xbd\xa0\xe5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
b'he\xecho(\xe4\xbd\xa0\xf5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
hello(你好)

A note about .encode('latin1'): The encoder is using Unicode strings and the Unicode code points U+0000 to U+00FF. Because Latin-1 is the first 256 Unicode code points, the 'latin1' codec will convert a Unicode string made up of those code points 1:1 to their byte values, resulting in a byte string with values ranging from 0-255.

4
rcgldr On

UTF-8 uses a variable length encoding that ranges from 1 to 4 bytes. As you're already found, flipping random bits can result in invalid encodings. Take a look at

https://en.wikipedia.org/wiki/UTF-8#Encoding

Reed Solomon normally uses fixed size elements, in this case probably 8 bit elements, in a bit string. For longer messages, it could use 10 bit, 12 bit, or 16 bit elements. It would make more sense to convert the UTF-8 message into a bit string, zero padded to an element boundary, and then perform Reed Solomon encoding to append parity elements to the bit string. When reading, the bit string should be corrected (or uncorrectable error detected) via Reed Solomon before attempting to convert the bit string back to UTF-8.