Before you slate me, yes I know that you shouldn't parse HTML with regex, you should use a dedicated parser. I don't have that option in the language I'm using (Xojo) and for various reasons, I need to use RegEx.
I'm trying to capture an entire block of HTML that may or may not contain nested HTML elements. Examples:
<blockquote> This is a blockquote with two paragraphs. Lorem ipsum dolor sit amet,
consectetuer adipiscing elit. Aliquam hendrerit mi posuere lectus.
Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus.
Donec sit amet nisl. Aliquam semper ipsum sit amet velit. Suspendisse
id sem consectetuer libero luctus adipiscing.</blockquote>
-----------------
<blockquote> This is the first level of quoting.
<blockquote> This is nested blockquote.</blockquote>
Back to the first level.</blockquote>
-----------------
<div>
Not nested
</div>
-----------------
<div>
Top level
<div>Nested</div>
</div>
I had come up with this pattern: <(\w*)>([\S\s]*?)<\/\1>
but whilst it works for blocks of HTML it fails if the block contains a block of HTML with the same tags as the parent block. Online example here.
I'm using the PRCE variant of RegEx and coding in Xojo.
Does anyone have any useful advice on how to solve this problem? Thank you.