So, far I can only keep one attribute but I am trying to keep both class and id attributes left in the HTML tags
Code:
$string = '<div id="one-id" class="someClassName">Some text <a href="#" title="Words" id="linkId" class="classLink">link</a> with only the class and id attrtibutes.</div>';
preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\sclass=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i", '<$1$2$3>', $string);
Output:
<div class="someClassName">Some text <a class="classLink">link</a> with only the class and id attrtibutes./div>
I am trying to remove all other attributes from every tag except the class and id attributes.
Using the DOMDocument(); adds extra p tags to the output for some reason and I believe xpath is faster?
Iterate over all nodes in the dom, then loop over all attributes in reverse so that you can safely prune attributes that are not in your whitelist.
Code: (Demo)
Output:
...actually, XPath isn't really needed because we are iterating every node in the dom. (Demo)
Trying to parse valid HTML with a regular expression is going to be one or more of the following:
Regex does not know the difference between tags and text that merely looks like tags. What if the HTML tags and attributes use upper and lower case? What if single quotes, double quotes and/or backticks are used? What if an attribute has no assignment (e.g.
readonlyorchecked)? What if adata-attribute name ends withidortitle? What if an attribute value contains a quoting symbol which is escaped instead of html encoded? What if text looks like a starting tag, but isn't a tag at all?These are valid reasons to steer clear of parsing valid HTML with regex.