Matching unique groups while maintaining their order

Question

Matching unique groups while maintaining their order

1.1k Views Asked by Zach Gates At 06 June 2025 at 07:53

Is there a way to match unique groups of characters (words in the case below) in order of occurrence, purely in regex? If so, how does that expression compare in efficiency to a non-regex solution? I'm working with Python's flavor, but I would be interested in a solution for any other flavor, as well.

Here's a sample case:

string = 'the floodwaters are rising along the coast'
unique = ['the', 'floadwaters', 'are', 'rising', 'along', 'coast']

In a Python-regex hybrid solution I could match the groups I want, and use a list comprehension to remove the duplicates while maintaining the order.

groups = re.findall('[a-zA-Z]+', string)
unique = [g for i, g in enumerate(groups) if g not in groups[:i]]

There are similar questions across the site, such as one that addresses matching unique words. The expression from the accepted answer, however, matches the furthest right occurence of a given group, while I want to match the first occurence. Here's that expression:

(\w+\b)(?!.*\1\b)

Original Q&A

There are 1 best solutions below

**Wiktor Stribiżew** · Accepted Answer

A regex-only solution for this kind of task is only possible with an infinite-width lookbehind.

However, a regex solution like this should only be considered for use when the input is relatively short: more than 100 words in an input string will make it very slow due to backtracking that is inevitable in this case. Thus, for a mere learning purpose, I will share the regex that is only supported in .NET and Python PyPi regex library (it is also possible to do in Vim as its lookbehind is also infinite-width, but I guess there are even simpler ways with that powerful tool).

(?s)\b(\w+)\b(?<!^.*\b\1\b.*\b\1\b)

See the regex demo

The (?s) part is an inline modifier that makes . match all line breaks. You may use regex.DOTALL in Python regex.

Details

\b - initial word boundary
(\w+) - Group 1: one or more word chars
\b - trailing word boundary
(?<!^.*\b\1\b.*\b\1\b) - an infinite width negative lookbehind that fails the match if the word matched into Group 1 happens to appear at least once before itself, i.e. if, immediately to the left of the current location (that is right after the word captured), there is a sequence of patterns:
- ^ - start of string
- .*\b\1\b - any zero or more chars, as many as possible and then the same value as in Group 1 as a whole word
- .*\b\1\b - same as above (needed to match the captured word, since the lookbehind is used after the consumed word)

The .* in the lookbehind causes lots of backtracking, and the pattern will work rather slow in general, and very slow with large inputs and eventually might cause time outs.

Matching unique groups while maintaining their order

There are 1 best solutions below

Related Questions in REGEX

Related Questions in PERFORMANCE

Related Questions in REGEX-GROUP

Trending Questions

Popular # Hahtags

Popular Questions