Is there a way to match unique groups of characters (words in the case below) in order of occurrence, purely in regex? If so, how does that expression compare in efficiency to a non-regex solution? I'm working with Python's flavor, but I would be interested in a solution for any other flavor, as well.
Here's a sample case:
string = 'the floodwaters are rising along the coast'
unique = ['the', 'floadwaters', 'are', 'rising', 'along', 'coast']
In a Python-regex hybrid solution I could match the groups I want, and use a list comprehension to remove the duplicates while maintaining the order.
groups = re.findall('[a-zA-Z]+', string)
unique = [g for i, g in enumerate(groups) if g not in groups[:i]]
There are similar questions across the site, such as one that addresses matching unique words. The expression from the accepted answer, however, matches the furthest right occurence of a given group, while I want to match the first occurence. Here's that expression:
(\w+\b)(?!.*\1\b)
A regex-only solution for this kind of task is only possible with an infinite-width lookbehind.
However, a regex solution like this should only be considered for use when the input is relatively short: more than 100 words in an input string will make it very slow due to backtracking that is inevitable in this case. Thus, for a mere learning purpose, I will share the regex that is only supported in .NET and Python PyPi
regexlibrary (it is also possible to do in Vim as its lookbehind is also infinite-width, but I guess there are even simpler ways with that powerful tool).See the regex demo
The
(?s)part is an inline modifier that makes.match all line breaks. You may useregex.DOTALLin Pythonregex.Details
\b- initial word boundary(\w+)- Group 1: one or more word chars\b- trailing word boundary(?<!^.*\b\1\b.*\b\1\b)- an infinite width negative lookbehind that fails the match if the word matched into Group 1 happens to appear at least once before itself, i.e. if, immediately to the left of the current location (that is right after the word captured), there is a sequence of patterns:^- start of string.*\b\1\b- any zero or more chars, as many as possible and then the same value as in Group 1 as a whole word.*\b\1\b- same as above (needed to match the captured word, since the lookbehind is used after the consumed word)The
.*in the lookbehind causes lots of backtracking, and the pattern will work rather slow in general, and very slow with large inputs and eventually might cause time outs.