Regex : How to avoid matching a word in a string upon a condition

803 Views Asked by At

I have a problem with excluding some special condition. I already create example in this LINK.

If I have List of Sentence like this :

X-MAS TREE //it should be excluded because match my dictionary
BLA BLA TREE
XMAS TREE
X-MASTREE
X-TREE
X-MASTREE

And I have Dictionary that X-MAS TREE has synonyms: XMAS TREE, X-MASTREE, X-TREE, TREE. And I need to change all the synonyms word into my Dictionary Word.

How to exclude X-MAS TREE? Because all of those regex, will be replace with X-MAS TREE If I search with keyword TREE, it will be infinite loop, because X-MAS TREE has TREE

I already tried many combination, but is not working:

\b(XMAS TREE|X\-MASTREE|X\-TREE|TREE|(?!X\-MAS TREE)\b
\b(XMAS TREE|X\-MASTREE|X\-TREE|(?!X\-MAS \s)TREE)\b
\b(XMAS TREE|X\-MASTREE|X\-TREE|((?!X\-MAS )|\w*)TREE)\b
\b(XMAS TREE|X\-MASTREE|X\-TREE|(?:(?!X\-MAS) )TREE)\b

EDIT

I need to use Boundaries (for some reason), because I create the regex in my code, with looping, and need to use it for another Dictionary, that why, for this case, I need special condition (without change the structure code, only edit the regex TREE)

2

There are 2 best solutions below

0
On BEST ANSWER

If you want to match a whole word TREE that is not preceded with X-MAS and a whitespace, you may use a negative lookbehind (?<!X-MAS\\s) (or, to make sure the X-MAS is a whole word, (?<!\\bX-MAS\\s)):

String pat = "\\b(?<!X-MAS\\s)TREE\\b";

See the regex demo.

Also, if there can be more than 1 whitespace, say, from 1 up to 10, you may add a limiting quantifier {1,10} after \s to make sure more than 1 whitespace is still accounted for:

String pat = "\\b(?<!X-MAS\\s{1,10})TREE\\b";

Here, even if there are no or up to 10 whitespaces between X-MAS and TREE, the negative condition (the so-called constrained-width negative lookbehind) will work.

See this Java demo.

5
On

You can try this:

^(?!X-MAS\s+TREE\s*)(?=.*TREE).*$

Explanation

  1. ^ asserts position at start of a line
  2. Negative Lookahead (?!X-MAS\s+TREE\s*)
  3. \s+ matches any whitespace character (equal to [\r\n\t\f\v ])
  4. Positive Lookahead (?=.*TREE) Assert that the Regex below matches .*
  5. $ asserts position at the end of a line

To cover your comment's structure, you can try negative look behind

\b.*(?<!X-MAS )TREE\b

Tried here