removing whitespaces while parsing with regular expression

83 Views Asked by At

I am using regular expressions to parse a file for some patterns. However if there's some whitespace in the middle of my data I end up getting wrong data. I have data with this format:

067  000100 A
067  000200 B
067 000300  C
067  000400 D
067 000500  E
067  000600 F

I am trying to get the first two strings, the middle two digits of the second string and the value like this: (there's cases that I might have 7 digits that's why it's ok in this case to have a regex that goes one extra digit at the end)

('67 000100 ', '01', 'A')

I am using the following regular expression:

qnum = r'067'
subq = r' .00' #using . because I am not sure if there's one space or two!
fmt = r'(?sm)^(' + qnum + subq + r'(..)...)\s*(.*?)\s*$'
#data is a string with all those values and \n
result = re.findall(fmt,data, re.I)

but at the end I end up with the followings:

('67  000100 ', '01', 'A')
('67  000200 ', '02', 'B')
('67 000300  ', '30', 'C')

How can I get the proper header so there's only "one space" in the middle and also the correct middle digits?

3

There are 3 best solutions below

0
On BEST ANSWER

. doesn't mean an optional character; it just means a character. Instead of a space and ., you want \s+.

1
On

Can i try this way:

#!/usr/bin/python

import re

s = """
067  000100 A
067  000200 B
067 000300  C
067  000400 D
067 000500  E
067  000600 F
"""

for line in s.split('\n'):
    if line.split():
        m = re.match("(\d+\s+\d{2}(\d{2})\d{2})\s+(\S)", line)
        print m.groups()

output:

 ('067  000100', '01', 'A')
 ('067  000200', '02', 'B')
 ('067 000300', '03', 'C')
 ('067  000400', '04', 'D')
 ('067 000500', '05', 'E')
 ('067  000600', '06', 'F')
0
On

How about

>>> subq = r'\s*00'
>>> fmt = r'(?sm)^(' + qnum + subq + r'(..)...)\s*(.*?)\s*$'
>>> result = re.findall(fmt,data, re.I)
>>> result
[('067  000600 ', '06', 'F')]

Change made

  • subq = r'\s*00' Since you are not sure of the number of spaces \s* is used so that it will match any number of spaces