How can I parse sequence of binary digits in python. Following is an example for what i am trying to do.
I have a sequence of binary digits, for example
sequence = '1110110100110111011011110101100101100'
and, I need to parse this and extract the data.
Say the above sequence contains start, id, data and end fields
start is a 2 bit field, id is an 8 bit field, data field can vary from 1 to 8192 bits and end is a 4 bit field.
and after parsing I'm expecting the output as follows:
result = {start : 11,
id : 10110100,
data : 11011101101111010110010,
end : 1100,
}
I'm using this in one of my applications. I'm able to parse the sequence using regex but, the problem is regex must be written by the user. So as an alternative i'm using BNF grammar as grammars are more readable.
I tried solving this using python's parsimonious and pyparsing parsers. But am not able to find the solution for the fields with variable length.
The grammar I wrote in parsimonious
available for python is as follows:
grammar = """sequence = start id data end
start = ~"[01]{2}"
id = ~"[01]{8}"
data = ~"[01]{1,8192}"
end = ~"[01]{4}"
"""
Since the data field is of variable length, and the parser is greedy, the above sequence is not able to match with the above grammar. The parser takes end field bits into the data field.
I just simplified my problem to above example.
Let me describe the full problem. There are 3 kinds of packets (lets call them Token, Handshake and Data packets). Token and Handshake packets are of a fixed length and Data packet is variable length. (The example above shown is an example for data packet)
The input consists of a continuous stream of bits. Each packet beginning is marked by the "start" pattern and packet end is marked by the "end" pattern. Both of these are fixed bit patterns.
Example Token packet grammar:
start - 2 bits, id - 8 bits, address - 7bits, end - 4bits
111011010011011101100
Example Handshake packet grammar:
start - 2 bits, id - 8bits, end - 4 bits
11101101001100
Example top level rule:
packet = tokenpacket | datapacket | handshakepacket
If there were only one type of packet then slicing would work. But when we start parsing, we do not know which packet we will finally end up matching. This is why I thought of using a grammar as the problem is very similar to language parsing.
Can we make the slicing approach work in this case where we have 3 different packet types to be parsed?
Whats the best way to solve this problem?
Thanks in advance,
This will do, just use slicing for this job:
You will get the correct data from the string.