Git log grep: How to match commit-message substrings regardless of word-order?

949 Views Asked by At

I tried this

git log -i --all --grep='/(?=.*fix)(?=.*a)(?=.*bug)/'

but did not work.

2

There are 2 best solutions below

1
On BEST ANSWER

There are several issues:

  • When using grep, the regex pattern is not passed inside regex delimiters, but as a regular string
  • The PCRE-compliant pattern you used might not work as the default git grep regex engine is POSIX BRE flavor
  • The pattern you used matches fix, a or bug in any order on the same line but requires them all to be present. To match specified strings in any order you need alternation patterns, e.g. a|b|c. However, in POSIX BRE, alternation operator is not supported, although the POSIX extension available in GNU tools allows the use of \| alternation operator version.

So, if you plan to match entries with these 3 words in any order, you need to remove regex delimiters and enable PCRE regex engine:

git log -i -P --all --grep='^(?=.*fix)(?=.*a)(?=.*bug)'

Note the -P option that enables the PCRE regex engine. Also, mind what documentation says:

-P
--perl-regexp
Consider the limiting patterns to be Perl-compatible regular expressions.

Support for these types of regular expressions is an optional compile-time dependency. If Git wasn’t compiled with support for them providing this option will cause it to die.

If you want to match entries with any of the words, you can use

git log -i -E --all --grep='fix|a|bug'

With -E option, POSIX ERE syntax is enforced, and | is an alternation pattern in this regex flavor.

To match them as whole words, use \b or \</\> word boundaries:

git log -i -E --all --grep='\<(fix|a|bug)\>'
git log -i -E --all --grep='\b(fix|a|bug)\b'

NOTE for Windows users:

In Windows Git CMD or Windows console, ' must be replaced with ":

git log -i -P --all --grep="^(?=.*fix)(?=.*a)(?=.*bug)"
git log -i -E --all --grep="\b(fix|a|bug)\b"
0
On

however having some issues using PCRE on Mac. I am trying to solve those and will update here

Make sure to use Git 2.40 (Q1 2023): newer regex library macOS stopped enabling GNU-like enhanced BRE, where '\(A\|B\)' works as alternation, unless explicitly asked with the REG_ENHANCED flag.

This is now fixed. (See git mailing-list discussion)
That way, git-grep would behave the same as grep(1) on each platform, which is consistent with the principle of least astonishment (POLA). On macOS, plain git grep should use enhanced basic REs.

"git grep"(man) now can be compiled to do so, to retain the old behaviour.

See commit 54463d3 (08 Jan 2023) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit 5427bb4, 23 Jan 2023)

54463d32ef:use enhanced basic regular expressions on macOS

Reported-by: Marco Nenciarini
Suggested-by: Jeff King
Signed-off-by: René Scharfe

When 1819ad3 ("grep: fix multibyte regex handling under macOS", 2022-08-26, Git v2.39.0-rc0 -- merge listed in batch #1) started to use the native regex library instead of Git's own (compat/regex/), it lost support for alternation in basic regular expressions.

Bring it back by enabling the flag REG_ENHANCED on macOS when compiling basic regular expressions.


With Git 2.41 (Q2 2023), the userdiff regexp patterns for various filetypes that are built into the system have been updated to avoid triggering regexp errors from UTF-8 aware regex engines.

See commit be39144 (06 Apr 2023) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit cbfe844, 20 Apr 2023)

userdiff: support regexec(3) with multi-byte support

Reported-by: D. Ben Knoble
Reported-by: Eric Sunshine
Helped-by: Junio C Hamano
Signed-off-by: René Scharfe

Since 1819ad3 ("grep: fix multibyte regex handling under macOS", 2022-08-26, Git v2.39.0-rc0 -- merge listed in batch #1) we use the system library for all regular expression matching on macOS, not just for git grep.
It supports multi-byte strings and rejects invalid multi-byte characters.

This broke all built-in userdiff word regexes in UTF-8 locales because they all include such invalid bytes in expressions that are intended to match multi-byte characters without explicit support for that from the regex engine.

"|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word regexes to match a single non-space or multi-byte character.
The \xNN characters are invalid if interpreted as UTF-8 because they have their high bit set, which indicates they are part of a multi-byte character, but they are surrounded by single-byte characters.

Replace that expression with "|[^[:space:]]" if the regex engine supports multi-byte matching, as there is no need to have an explicit range for multi-byte characters then.
Check for that capability at runtime, because it depends on the locale and thus on environment variables.
Construct the full replacement expression at build time and just switch it in if necessary to avoid string manipulation and allocations at runtime.

Additionally the word regex for tex contains the expression "[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range.
The best replacement with only valid characters that I can come up with is "([a-zA-Z0-9]|[^\x01-\x7f])+".
Unlike the original it matches NUL characters, though.
Assuming that tex files usually don't contain NUL this should be acceptable.