Find and exclude html-tags as whole words in negative lookbehind with regex

40 Views Asked by Nixen85 At 02 February 2023 at 16:02

I basically try to find all paragraphs (in javascript/jquery) in a text, that are not yet wrapped in a set of defined html-tags:

p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe

My current regex (https://regex101.com/r/O4i2hP/1) already matches paragraphs and excludes the defined tags

(.+?(?<![</(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)>]$))(\n|$)+/gm

but I just don't get, how to just match whole tags only.

The problem is:

(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)> matches a single character in the list (p|h123456blockquteimgafr)> (case sensitive)

Thus, as you can see from the example, code that is wrapped in tags such as <strong>TEXT</strong> is also excluded.

I tried different things such as word boundaries \bword\b, but didn't get it working. I hope you can help. Thx

There are 2 best solutions below

John Williams On 02 February 2023 at 17:49 BEST ANSWER

This will do it.

^(?!<(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)+?>.</\1>).$

Nixen85 On 02 February 2023 at 17:21

I now found a working approach. The tags should be wrapped in groups rather than in character classes. The following works for me:

(.+?(?<!(<\/)(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)(>)$))(\n|$)+/gm