Trouble decoding malformed bytes to integer

58 Views Asked by At

I have a simple python socket server receiving "command" code that is encoded in ASCII. Most bytes are decoded properly with utf-8 by doing data.decode("utf-8"), but for some of them, that converts to some random characters through latin-1.

Here are two examples

byte_string1 = b'\xa3\xb67'  # When client sends 67
byte_string2 = b'\xa3\xb6\xa3\xb6' #When client sends 66

I can see the number 67 and 6-6 in the input, but have been unable to extract them out. Is there a proper way to handle these?

My current attempt and I am expecting strings back from data in bytes:

def get_command(data):
    try:
        command = data.decode("utf-8")
    except UnicodeDecodeError as err1:
        logger.debug(f"utf-8 UnicodeDecodeError: {err1} for data: {data}")
        try:
            command = data.decode("latin-1")
        except UnicodeDecodeError as err2:
            logger.debug(f"latin-1 UnicodeDecodeError: {err2} for data: {data}")
            logger.debug(
                f"Taking a guess that the bytes are integers, for data: {data}"
            )
            command = [b for b in data]
    return command

server_ip = '0.0.0.0'
server_port = 1234

server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_socket.bind((server_ip, server_port))
server_socket.listen(5)
while True:
    data = client_socket.recv(1024)
    if not data:
        break

    command = get_command(data)
1

There are 1 best solutions below

1
NoName On

Your issue is that you're trying to decode a custom byte encoding using standard decoders like UTF-8 and Latin-1. If the byte strings have a specific structure, you should extract the relevant parts manually.

In your case, it appears that the command bytes are encoded in the last part of the byte string. You can slice the byte string to get the relevant bytes.

Here's an optimized version of get_command():

def get_command(data):
    command_bytes = data[2:]  # Skipping first two bytes
    try:
        command = command_bytes.decode("utf-8")
    except UnicodeDecodeError:
        command = [b for b in command_bytes]
    return command

The above function assumes that the first two bytes are always irrelevant for your command decoding.

Update your main loop to incorporate this:

server_ip = '0.0.0.0'
server_port = 1234

server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_socket.bind((server_ip, server_port))
server_socket.listen(5)

while True:
    client_socket, _ = server_socket.accept()
    data = client_socket.recv(1024)
    if not data:
        break

    command = get_command(data)

This should solve your problem hopefully.


If the high bit of a byte is used to indicate a new header, you can scan through the byte string to detect these headers and then process the payload bytes accordingly.

Here's a function to do that:

def get_commands(data):
    commands = []
    i = 0
    while i < len(data):
        if data[i] == 0xa3:  # Header byte
            i += 1  # Move to next byte
            if i < len(data):
                msb = data[i] & 0x80  # Most Significant Bit
                lsb = data[i] & 0x7F  # Least Significant Bits
                i += 1  # Move to next byte

                # Construct the command
                command = bytes([msb, lsb])
                if i < len(data):
                    while data[i] & 0x80 == 0:  # No high bit set
                        command += bytes([data[i]])
                        i += 1
                        if i >= len(data):
                            break
                commands.append(command.decode("utf-8", errors="ignore"))
    return commands

This approach assumes that a new header starts when the high bit is set. Modify as needed.