I made a few tests to help myself to understand non-greedy in Python, but it made me much more confused than before. Thank you for the help!
lan='From 000@[email protected]@uct.ac.za@bbb@ccc fff@ddd eee'
print(re.findall('\S+@\S+?',lan)) # 1
print(re.findall('\S+@\S+',lan)) # 2
print(re.findall('\S+?@\S+?',lan)) # 3
print(re.findall('\S+?@\S+',lan)) # 4
Result:
['000@[email protected]@uct.ac.za@bbb@c', 'fff@d'] # 1
['000@[email protected]@uct.ac.za@bbb@ccc', 'fff@ddd'] # 2
['000@h', 'hhaaa@s', 'tephen.marquard@u', 'ct.ac.za@b', 'bb@c', 'fff@d'] # 3
['000@[email protected]@uct.ac.za@bbb@ccc', 'fff@ddd'] # 4
Question:
- why result only shows one d here - @d?
- is normal, very clear.
- very confusing, I even do not know how to ask the logic behind... Especially when compared with 1...
- it seems it is same as 2, so why ? before @ is so 'weak'?
Because
+?
is not required to match more than once, so it doesn't.Again,
+?
matches as many characters as it has to - as opposed to matching as many characters as it can, which is exactly the difference between greedy and non-greedy matching.On the example of
\S+?@\S+?
matchingFrom 000@[email protected]@uct.ac.za@bbb@ccc
:From
, but then it fails because there is a space.000
, then the@
matches, then\S+?
again matches as many\S
as it has to. It has to match 1 character.000@h
.Explained above.
Since email addresses can't contain spaces, why bother with non-greedy matching anyway? You could use something as simple as
\S+@\S+
.