I have the following bit of HTML
<div class="article">this is a div article content</div>
which is being "tagged" by an HTML-agnostic program on the words div
, class
and article
, resulting in:
<<hl>div</hl> <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</<hl>div</hl>>
although what I really need is:
<div class="article">this is a <hl>div</hl> <hl>article</hl> content</div>
Since the output is utter garbage (even tools like HTML Tidy
choke on it), I figured a regex replace would help strip out the extra <hl>
s inside the HTML tag:
replace(/<([^>]*)<hl>([^<]*?)<\/hl>([^>]*?)>/g, '<$1$2$3>')
Now, this works but only replaces the first occurrence in the tag, that is, the div
:
<div <hl>class</hl>="<hl>article</hl>">this is a <hl>div</hl> <hl>article</hl> content</div>
My question is: how do I replace all <hl>
s inside the tag, so as to make sure the HTML remains valid?
Additional notes:
- I don't need the tag attributes at all (i.e.
class="article"
can disappear) - I can change
<hl>
and</hl>
for any other strings - Yes, the output comes from Solr
UPDATE: I accepted jcollado's answer, but I needed this in Javascript. This is the equivalent code:
var stripIllegalTags = function(html) {
var output = '',
dropChar,
parsingTag = false;
for (var i=0; i < html.length; i++) {
var character = html[i];
if (character == '<') {
if (parsingTag) {
do {
dropChar = html[i+1];
i++;
} while (dropChar != '>');
continue;
}
parsingTag = true;
} else if (character == '>') {
parsingTag = false;
}
output += character;
}
return output;
}
Maybe the piece of code below is helpful for you:
The output for the given input is:
which I believe is what you're looking for.
The code basically drops all tags when another tag hasn't been parsed yet.