Matching unique groups while maintaining their order

1.1k Views Asked by At

Is there a way to match unique groups of characters (words in the case below) in order of occurrence, purely in regex? If so, how does that expression compare in efficiency to a non-regex solution? I'm working with Python's flavor, but I would be interested in a solution for any other flavor, as well.

Here's a sample case:

string = 'the floodwaters are rising along the coast'
unique = ['the', 'floadwaters', 'are', 'rising', 'along', 'coast']

In a Python-regex hybrid solution I could match the groups I want, and use a list comprehension to remove the duplicates while maintaining the order.

groups = re.findall('[a-zA-Z]+', string)
unique = [g for i, g in enumerate(groups) if g not in groups[:i]]

There are similar questions across the site, such as one that addresses matching unique words. The expression from the accepted answer, however, matches the furthest right occurence of a given group, while I want to match the first occurence. Here's that expression:

(\w+\b)(?!.*\1\b)
1

There are 1 best solutions below

0
On BEST ANSWER

A regex-only solution for this kind of task is only possible with an infinite-width lookbehind.

However, a regex solution like this should only be considered for use when the input is relatively short: more than 100 words in an input string will make it very slow due to backtracking that is inevitable in this case. Thus, for a mere learning purpose, I will share the regex that is only supported in .NET and Python PyPi regex library (it is also possible to do in Vim as its lookbehind is also infinite-width, but I guess there are even simpler ways with that powerful tool).

(?s)\b(\w+)\b(?<!^.*\b\1\b.*\b\1\b)

See the regex demo

The (?s) part is an inline modifier that makes . match all line breaks. You may use regex.DOTALL in Python regex.

Details

  • \b - initial word boundary
  • (\w+) - Group 1: one or more word chars
  • \b - trailing word boundary
  • (?<!^.*\b\1\b.*\b\1\b) - an infinite width negative lookbehind that fails the match if the word matched into Group 1 happens to appear at least once before itself, i.e. if, immediately to the left of the current location (that is right after the word captured), there is a sequence of patterns:
    • ^ - start of string
    • .*\b\1\b - any zero or more chars, as many as possible and then the same value as in Group 1 as a whole word
    • .*\b\1\b - same as above (needed to match the captured word, since the lookbehind is used after the consumed word)

The .* in the lookbehind causes lots of backtracking, and the pattern will work rather slow in general, and very slow with large inputs and eventually might cause time outs.