Is it possible to write a grep -P (PCRE) command that prints the lines containing only A and B such that there are exactly n A's followed by exactly n B's and no other characters. Such that these are valid matches:
AB
AAABBB
AAAAAAABBBBBBB
AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBB
while these are not:
AAABB
ABBBBBB
BBBA
ABABA
BBBBBBBB
With normal regular expressions, you can't do this - they can only match regular context-free languages (Type 3 in the Chomsky hierarchy of languages), while what you want to match is a classic example of a type 2 language.
Luckily,
perlregular expressions aren't very regular in the formal language theory sense. You can match this using a recursive regular expression:(Where
input.txtcontains all your test cases).This matches either an empty string (0 A's followed by 0 B's), or a string starting with A, a successful recursive match of the pattern against the rest of the string minus the first and last characters, and ending with a B. If a B appears before an A, an A after a B, or the total number of A's and B's don't match, it thus fails.
(?>regex)is an optimization that prevents backtracking after a match failure.If you want to enforce
n >= 1, a slight variation to lift one pair of A and B outside of the recursive section:^A((?>A(?1)B|))B$.