Parse a sequence of binary digits

347 Views Asked by At

How can I parse sequence of binary digits in python. Following is an example for what i am trying to do.

I have a sequence of binary digits, for example

sequence =  '1110110100110111011011110101100101100'

and, I need to parse this and extract the data.

Say the above sequence contains start, id, data and end fields

start is a 2 bit field, id is an 8 bit field, data field can vary from 1 to 8192 bits and end is a 4 bit field.

and after parsing I'm expecting the output as follows:

result = {start : 11,
          id : 10110100,
          data : 11011101101111010110010,
          end : 1100,
         }

I'm using this in one of my applications. I'm able to parse the sequence using regex but, the problem is regex must be written by the user. So as an alternative i'm using BNF grammar as grammars are more readable.

I tried solving this using python's parsimonious and pyparsing parsers. But am not able to find the solution for the fields with variable length.

The grammar I wrote in parsimonious available for python is as follows:

grammar = """sequence = start id data end

start = ~"[01]{2}"
id = ~"[01]{8}"
data = ~"[01]{1,8192}"
end = ~"[01]{4}"
"""

Since the data field is of variable length, and the parser is greedy, the above sequence is not able to match with the above grammar. The parser takes end field bits into the data field.

I just simplified my problem to above example.

Let me describe the full problem. There are 3 kinds of packets (lets call them Token, Handshake and Data packets). Token and Handshake packets are of a fixed length and Data packet is variable length. (The example above shown is an example for data packet)

The input consists of a continuous stream of bits. Each packet beginning is marked by the "start" pattern and packet end is marked by the "end" pattern. Both of these are fixed bit patterns.

Example Token packet grammar:

start - 2 bits, id - 8 bits, address - 7bits, end - 4bits
111011010011011101100

Example Handshake packet grammar:

start - 2 bits, id - 8bits, end - 4 bits
11101101001100

Example top level rule:

packet = tokenpacket | datapacket | handshakepacket

If there were only one type of packet then slicing would work. But when we start parsing, we do not know which packet we will finally end up matching. This is why I thought of using a grammar as the problem is very similar to language parsing.

Can we make the slicing approach work in this case where we have 3 different packet types to be parsed?

Whats the best way to solve this problem?

Thanks in advance,

3

There are 3 best solutions below

4
On

This will do, just use slicing for this job:

def binParser(data):
    result = {}
    result["start"] = data[:2]
    result["id"] = data[2:8]
    result["end"] = data[-4:]
    result["data"] = data[10:-4]
    return result

You will get the correct data from the string.

2
On

Presumably, there will only ever be one variable-length field, so you can allow this by defining a distance from the start of the sequence and a distance from the end, e.g.

rules = {'start': (None, 2), 'id': (2, 10), 
         'data': (10, -4), 'end': (-4, None)}

and then use slicing:

sequence =  '1110110100110111011011110101100101100'

result = dict((k, sequence[v[0]:v[1]]) for k, v in rules.items())

This gives:

result == {'id': '10110100', 
           'end': '1100', 
           'data': '11011101101111010110010', 
           'start': '11'}
0
On

Since you mentioned pyparsing in the tags, here is how I would go about it using pyparsing. This uses Daniel Sanchez's binParser for post-processing.

from pyparsing import Word

#Post-processing of the data.
def binParser(m):
    data = m[0]
    return {'start':data[:2],
            'id':data[2:8],
            'end':data[-4:],
            'data':data[10:-4]}
#At least 14 character for the required fields, attaching the processor
bin_sequence = Word('01',min=14).setParseAction(binParser)


sequence =  '1110110100110111011011110101100101100'
print bin_sequence.parseString(sequence)[0]

This could then be used as part of a larger parser.