'UTF-8' decoding error while using unireedsolomon package

Question

'UTF-8' decoding error while using unireedsolomon package

211 Views Asked by Sid At 17 June 2022 at 07:29

I have been writing a code using the unireedsolomon package. The package adds parity bytes which are mostly extended ASCII characters. I am applying bit-level errors after converting the 'special character' parities using the following code:

def str_to_byte(padded):
    byte_array = padded.encode()
    binary_int = int.from_bytes(byte_array, "big")
    binary_string = bin(binary_int)
    without_b = binary_string[2:]
    return without_b

def byte_to_str(without_b):
    binary_int = int(without_b, 2)
    byte_number = binary_int.bit_length() + 7 // 8
    binary_array = binary_int.to_bytes(byte_number, "big")
    ascii_text = binary_array.decode()
    padded_char = ascii_text[:]
    return padded_char

After conversion from string to a bit-stream I try to apply errors randomly and there are instances when I am not able to retrieve those special-characters (or characters) back and I encounter the 'utf' error before I could even decode the message.

If I flip a bit or so it has to be inside the 255 ASCII character values but somehow I am getting errors. Is there any way to rectify this ?

Original Q&A

There are 2 best solutions below

rcgldr On 17 June 2022 at 10:12

UTF-8 uses a variable length encoding that ranges from 1 to 4 bytes. As you're already found, flipping random bits can result in invalid encodings. Take a look at

https://en.wikipedia.org/wiki/UTF-8#Encoding

Reed Solomon normally uses fixed size elements, in this case probably 8 bit elements, in a bit string. For longer messages, it could use 10 bit, 12 bit, or 16 bit elements. It would make more sense to convert the UTF-8 message into a bit string, zero padded to an element boundary, and then perform Reed Solomon encoding to append parity elements to the bit string. When reading, the bit string should be corrected (or uncorrectable error detected) via Reed Solomon before attempting to convert the bit string back to UTF-8.

**Mark Tolonen** · Accepted Answer · 2022-06-21T00:34:23.193000

It's a bit odd that the encryption package works with Unicode strings. Better to encrypt byte data since it may not be only text that is encrypted/decrypted. Also no need for working with actual binary strings (Unicode 1s and 0s). Flip bits in the byte strings.

Below I've wrapped the encode/decode routines so they take either Unicode text and return byte strings or vice versa. There is also a corrupt function that will flip bits in the encoded result to see the error correction in action:

import unireedsolomon as rs
import random

def corrupt(encoded):
    '''Flip up to 3 bits (might pick the same bit more than once).
    '''
    b = bytearray(encoded) # convert to writable bytes
    for _ in range(3):
        index = random.randrange(len(b)) # pick random byte
        bit = random.randrange(8)        # pic random bit
        b[index] ^= 1 << bit             # flip it
    return bytes(b) # back to read-only bytes, but not necessary

def encode(coder,msg):
    '''Convert the msg to UTF-8-encoded bytes and encode with "coder".  Return as bytes.
    '''
    return coder.encode(msg.encode('utf8')).encode('latin1')

def decode(coder,encoded):
    '''Decode the encoded message with "coder", convert result to bytes and decode UTF-8.
    '''
    return coder.decode(encoded)[0].encode('latin1').decode('utf8')

coder = rs.RSCoder(20,13)
msg = 'hello(你好)'  # 9 Unicode characters, but 13 (maximum) bytes when encoded to UTF-8.
encoded = encode(coder,msg)
print(encoded)
corrupted = corrupt(encoded)
print(corrupted)
decoded = decode(coder,corrupted)
print(decoded)

Output. Note that the first l in hello (ASCII 0x6C) corrupted to 0xEC, then second l changed to an h (ASCII 0x68) and another byte changed from 0xE5 to 0xF5. You can actually randomly change any 3 bytes (not just bits) including error-correcting bytes and the message will still decode.

b'hello(\xe4\xbd\xa0\xe5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
b'he\xecho(\xe4\xbd\xa0\xf5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
hello(你好)

A note about .encode('latin1'): The encoder is using Unicode strings and the Unicode code points U+0000 to U+00FF. Because Latin-1 is the first 256 Unicode code points, the 'latin1' codec will convert a Unicode string made up of those code points 1:1 to their byte values, resulting in a byte string with values ranging from 0-255.

'UTF-8' decoding error while using unireedsolomon package

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in UTF-8

Related Questions in CHARACTER-ENCODING

Related Questions in BYTE

Related Questions in REED-SOLOMON

Trending Questions

Popular # Hahtags

Popular Questions