I'am new to regular expression.
I'D like to extract all the string between html tag
that contain a substring
For example , for the HTML below :
<span class='mouseOverHeader'>Test TEst</span>
<div class='mouseOverData'>
xxx cccc ccccc<br>qqq wwww wwww<br>qqq qqq MYSUBSTRING<br><br>
<a id="email" style="cursor:pointer" onclick=">mmmmmm</a>
</div>
I'd like to extract the string "qqq qqq MYSUBSTRING" because it contains the substring I am looking for "MYSUBSTRING"
Thanks a lot for help
The usual way to parse HTML is by building a tree (something similar to BeautifulSoup in Python, HTML::Tree in Perl)
The reason for this is that because of the nested nature of HTML's tags, as well as embedding other languages, Regex will often fail and/or produce a wrong output. I believe the way those modules are doing it is by building a tree via pushing and popping tags (read brackets
<>
) onto a stack and popping them out whenever they are closed (read</>
)If you however wish to stick with regex, try starting with this:
This regex will only match the first group, but the
while
loop will make it match until it hits a substring that does not have the pattern. I would strongly recommend you to practice on some online visual Regex matcher, that highlights the groups as you type (try RegExr).