regex string between html <br>

107 Views Asked by At

I'am new to regular expression. I'D like to extract all the string between html tag

that contain a substring For example , for the HTML below :

 <span class='mouseOverHeader'>Test TEst</span>
 <div class='mouseOverData'>
 xxx cccc ccccc<br>qqq wwww wwww<br>qqq qqq MYSUBSTRING<br><br>
 <a id="email" style="cursor:pointer" onclick=">mmmmmm</a>
 </div>

I'd like to extract the string "qqq qqq MYSUBSTRING" because it contains the substring I am looking for "MYSUBSTRING"

Thanks a lot for help

1

There are 1 best solutions below

0
On

The usual way to parse HTML is by building a tree (something similar to BeautifulSoup in Python, HTML::Tree in Perl)

The reason for this is that because of the nested nature of HTML's tags, as well as embedding other languages, Regex will often fail and/or produce a wrong output. I believe the way those modules are doing it is by building a tree via pushing and popping tags (read brackets <>) onto a stack and popping them out whenever they are closed (read </>)

If you however wish to stick with regex, try starting with this:

while($code =~ m/<br>(.+?)<br>/g)
{
 print "$1\n";
}

This regex will only match the first group, but the while loop will make it match until it hits a substring that does not have the pattern. I would strongly recommend you to practice on some online visual Regex matcher, that highlights the groups as you type (try RegExr).