I'm using strip_tags to make sure every HTML tag is removed before saving a string.
Now I got the issue that also single < without any ending tags are removed.
The idea is now to replace every single < with the matching HTML entity <
I got a regex for this but it only replaces the first find, any idea how I can adjust this?
This is the regex I got for now: preg_replace("/<([^>]*(<|$))/", "<$1", $string);
I want this:
<p> Hello < 30 </p> < < < <!-- Test --> <> > > >
to become first this with preg_replace(REGEX, REPLACE, $string):
<p> Hello < 30 </p> < < < <!-- Test --> <> > > >
and then this after strip_tags($string):
Hello < 30 < < < <> > > >
Any idea how I can achive that?
Maybe you even know a better way.
Your question is interesting, and therefore I took the time to try and solve it. I think that the only way is to do it in several steps:
The first step would be to remove HTML comments.
The next step is to try and match all HTML tags with a regular expression in order to rewrite them into another form, replacing the
<and>chars by something else, such as[[and respectively]].After that, you can replace
<by<and>by>.We replace back our
[[tag attr="value"]]and[[/tag]]by the original HTML tag<tag attr="value">and</tag>.We can now strip the HTML tags we want with
strip_tags()or with a safer and more flexible library such as HTMLPurifier.The PHP code
Sorry, but the color highlighting seems to bug due to my use of Nowdoc strings for editing ease :
You can run it here: https://onlinephp.io/c/005a3
For the regular expression, I used
~instead of the usual/to delimit the pattern and the flags. This is just because we then can use the slash without escaping it in the pattern.I also used the
xflag for the extended notation so that I can put some comments in my pattern and write it on several lines.Just for readability and flexibility, I also used named capturing groups, such as
(?<quote>)so that we don't have indexes, which could move if we add some other capturing groups. A backreference is done with\k<quote>instead of the indexed version\4.HTML5 seems quite permissive as it seems that the
>char can be put in an attribute value without replacing it by>. I suppose this wasn't allowed in the past and it become "ok/accepted" to help users write readablepatternattributes on<input>fields. I added an example of a password field where you are not allowed to use the<and>chars. This was to show how to handle it in the regular expression, by accepting an attribute with a single or double quoted value.The output:
As you can see,
strip_tags()isn't handling spaces around the tag name, which I found completely unsafe! This is why I would suggest using a library such as HTMLPurifier or a DOM parser.