In Python, I am using regular expressions to retrieve strings from a dictionary which show a specific pattern, such as having some repetitions of characters than a specific character and another repetitive part (e.g. ^(\w{0,2})o(\w{0,2})$
).
This works as expected, but now I'd like to split the string in two substrings (eventually one might be empty) using the central character as delimiter. The issue I am having stems from the possibility of multiple overlapping matches inside a string (e.g. I'd want to use the previous regex to split the string room in two different ways, (r, om) and (ro, m)).
Both re.search().groups()
and re.findall()
did not solve this issue, and the docs on the re module seems to point out that overlapping matches would not be returned by the methods.
Here is a snippet showing the undesired behaviour:
import re
dictionary = ('room', 'door', 'window', 'desk', 'for')
regex = re.compile('^(\w{0,2})o(\w{0,2})$')
halves = []
for word in dictionary:
matches = regex.findall(word)
if matches:
halves.append(matches)
I am posting this as an answer mainly not to leave the question answered in the case someone stumbles here in the future and since I've managed to reach the desired behaviour, albeit probably not in a very pythonic way, this might be useful as a starting point from someone else. Some notes on how improve this answer (i.e. making more "pythonic" or simply more efficient would be very welcomed).
The only way of getting all the possible splits of the words having length in a certain range and a character in certain range of positions, using the characters in the "legal" positions as delimiters, both using the
re
and the newregex
modules involves using multiple regexes. This snippet allows to create at runtime an appropriate regex knowing the length range of the word, the char to be seek and the range of possible positions of such character.The output is:
At this point I might start considering using a regex just to find the to words to be split and the doing the splitting in 'dumb way' just checking if the characters in the range positions are equal
char
. Anyhow, any remark is extremely appreciated.