I have the following html to parse:
<h1 class="x">test</h1>
<p>some text <img src="x" /></p>
<h1 class="x1">test2</h1>
<p>some text </p>
<h1 class="2">test3</h1>
<p>some text <img src="x" /></p>
Can I parse this into an array with a single regular expression?
I tried
preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*)#ism',$html,$arr);
which gives me only one entry, because the last part of the regex is greedy, and
preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*?)#ism',$html,$arr);
which gives me nothing of the HTML between the <h1>
, because the expression is not greedy.
How can I make the part after the be matched greedy, while at the same time matching as many occurences as possible?
Additional comments:
- the question is fairly academical, I have resolved the problem using pre_split and a variety of other methods would work, but may also have downsides (for example DOM may not work on invalid HTML that I cannot control). However it is a recurring problem that I'd be interested to know more about.
You need some form of end maker. The regex can not guess until which part you want to match.
Possible in this case might be a lookahead assertion after the
(.*?)
at the end: