Using regex to create a list of dictionaries with positive lookbehind

318 Views Asked by At

I am trying to create a list of dictionaries using regex positive lookbehind. I tried two different codes:

Variation 1

string = '146.204.224.152 - lubo233'

for item in re.finditer( "(?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)(?P<user_name>(?<= - )[a-z]*[0-9]*)", string ):
    print(item.groupdict())

Variation 2

string = '146.204.224.152 - lubo233'
for item in re.finditer( "(?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)(?<= - )(?P<user_name>[a-z]*[0-9]*)", string ):
    print(item.groupdict())

Desired Output

{'host': '146.204.224.152', 'user_name': 'lubo233'}

Question/Issue

In both cases, I am unable to eliminate the substring " - ".

The use of positive lookbehind (?<= - ) renders my code wrong.

Can anyone assist to identify my mistake? Thanks.

2

There are 2 best solutions below

2
azro On

I'd suggest you remove the positive lookbehind and just put the join character normally, between each parts

Also some improvements

  • \. instead of [.]

  • [0-9]{,3} instead of [0-9]*

  • (?:\.[0-9]{,3}){3} instead of \.[0-9]{,3}\.[0-9]{,3}\.[0-9]{,3}

Add a .* along with the - to handle any word that could be there

rgx = re.compile(r"(?P<host>[0-9]{,3}(?:\.[0-9]{,3}){3}).* - (?P<user_name>[a-z]*[0-9]*)")

vals = ['146.204.224.152 aw0123 abc - lubo233',
        '146.204.224.152 as003443af - lubo233',
        '146.204.224.152 - lubo233']

for val in vals:
    for item in rgx.finditer(val):
        print(item.groupdict())

# Gives
{'host': '146.204.224.152', 'user_name': 'lubo233'}
{'host': '146.204.224.152', 'user_name': 'lubo233'}
{'host': '146.204.224.152', 'user_name': 'lubo233'}
2
Dani Mesejo On

The reason that the positive lookbehind is not working is that you are trying to match:

  • (?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*) an IP address
  • immediately followed by a user name pattern: (?P<user_name>(?<= - )[a-z]*[0-9]*) that should be preceded by (?<= - )

So once the regex engine has consumed the IP address pattern you are telling that should match a user name pattern preceded by (?<= - ) but what is preceding is the IP address pattern. In other terms, once the IP pattern has been matched the string left is:

- lubo233

The pattern that should be immediately matched, as in re.match, is:

(?P<user_name>(?<= - )[a-z]*[0-9]*) 

that obviously does not match. To illustrate my point, see that this pattern works:

import re

string = '146.204.224.152 - lubo233'
for item in re.finditer(r"((?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)( - ))(?P<user_name>(?<= - )[a-z]*[0-9]*)", string):
    print(item.groupdict())

Output

{'host': '146.204.224.152', 'user_name': 'lubo233'}

If you need to match an arbitrary number of characters between the two patterns, you could do:

import re

string = '146.204.224.152 adfadfa - lubo233'
for item in re.finditer(r"((?P<host>\d{3,}[.]\d{3,}[.]\d{3,})(.* - ))(?P<user_name>(?<= - )[a-z]*[0-9]*)", string):
    print(item.groupdict())

Output

{'host': '146.204.224', 'user_name': 'lubo233'}