I am trying to use regex to match any text except for HTML tags. I have found this solution for "normal" HTML code:
<[^>]*>(*SKIP)(*F)|[^<]+
However, my code is encoded using < and > instead of < and >, and I have not been able to modify the regex above for it to work.
As an example, given the text:
Hi <p class=\"hello\">\r\nthere, how are you\r\n</p>
I need to match "hi" and "there, how are you". Note that I need to match text that is not between tags as well, "hi", in this example.
UPDATE: since I am using ruby's gsub, it looks like I cannot even use *SKIP and *F
UPDATE 2: I was trying not to get into much detail but seems to be important:
I actually need to replace all the spaces from a text, but not those spaces that are part of a tag, be it a < ... > tag or a <...> tag.
You can use
I suggest
[[:blank:]]instead of\ssince I assume you do not want to replace line breaks. See the Ruby demo.The regex above matches
(<.*?>|<[^>]*>)- either<, any zero or more chars as few as possible, and>or<, then zero or more chars other than>and then a>|- or[[:blank:]]- any single horizontal whitespace (you may also use[\p{Zs}\t]to match any Unicode horizontal whitespace).The
{ $1 || '_' }block in the replacement means that when Group 1 matches, the Group 1 value is returned as is, else,_is returned as a replacement of a horizontal whitespace.