preg_match_all - greedy part of regex, but maximise number of matches

952 Views Asked by At

I have the following html to parse:

<h1 class="x">test</h1>
<p>some text <img src="x" /></p>

<h1 class="x1">test2</h1>
<p>some text </p>

<h1 class="2">test3</h1>
<p>some text <img src="x" /></p>

Can I parse this into an array with a single regular expression?

I tried

preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*)#ism',$html,$arr);

which gives me only one entry, because the last part of the regex is greedy, and

preg_match_all('#(<h1[^>]*?>)(.*?)(</h1>)(.*?)#ism',$html,$arr);

which gives me nothing of the HTML between the <h1>, because the expression is not greedy.

How can I make the part after the be matched greedy, while at the same time matching as many occurences as possible?

Additional comments:

  • the question is fairly academical, I have resolved the problem using pre_split and a variety of other methods would work, but may also have downsides (for example DOM may not work on invalid HTML that I cannot control). However it is a recurring problem that I'd be interested to know more about.
2

There are 2 best solutions below

2
On BEST ANSWER

You need some form of end maker. The regex can not guess until which part you want to match.

Possible in this case might be a lookahead assertion after the (.*?) at the end:

(?=<h1|</body>|\z)#ims
0
On

Ignoring the comments about how regex is unsuitable, because it's still an interesting problem, there are two ways to approach this: Greedy and lazy.

The respective parts of the pattern are:

  • Lazy: .*?(?=<h1|\z)
  • Greedy: (?:[^<]+|<(?!h1))*

You may be familiar with the performance of greedy vs. lazy qualifiers in general, but the crux here is much simpler.

If the string you're trying to match consists entirely of the character <, then the lazy and greedy patterns will perform about the same, because they both have to check the assertion for every matched character.

However in HTML you have far more other characters than < characters so the greedy pattern, which doesn't have to check the other characters, can be orders of magnitude faster.

I admit that the lazy pattern is easier to read, but I think that the vastly better performance is worth it and would highly recommend commenting your patterns with the x modifier anyway.