GNU grep, backreferences and wildcards

74 Views Asked by SlowFox At 26 May 2022 at 07:33

Using grep (GNU grep 3.3) to search for all words with three consecutive double-letters (resulting in "bookkeeper"):

grep -E "((.)\2){3}" /usr/share/dict/american-english

Changing this to search for words with three double-letters, each of them followed by the letter "i" (resulting in "Mississippi"):

grep -E "((.)\2i){3}" /usr/share/dict/american-english

Changing this to search for words with three double-letters, each of them followed by any single letter (with a couple of results):

grep -E "((.)\2.){3}" /usr/share/dict/american-english

Changing this to search for words with three double-letters separated by an optional, single letter (even more results):

grep -E "((.)\2.?){3}" /usr/share/dict/american-english

Now, finally, my original task: Search for all words containing three double-letters:

grep -E "((.)\2.*){3}" /usr/share/dict/american-english

But this results in an empty set. Why? How can .? match something .* does not?

There are 1 best solutions below

Wiktor Stribiżew On 26 May 2022 at 08:56 BEST ANSWER

The POSIX regex engine does not handle patterns with back-references well, matching back references is an NP complete problem might provide some hints on why it is that difficult.

Since you are using a GNU grep, the problem is easily solved with the PCRE engine,

grep -P '((.)\2.*){3}' file

since the PCRE engine can handle back-references in a more efficient way than the POSIX regex engine.