Replace word even if it has empty HTML tags between it, which breaks it up

217 Views Asked by At

So this is a rather odd question, I know that. I use a tool called pdf2htmlEX, which converts a PDF to HTML. So far the results has been pretty damn impressive. I have yet seen a single error in all the PDFs I have converted to HTML.

With this HTML, I need to replace some strings dynamically with C#. However, I can't simply say line.Replace("#SOME_STRING", "Another string"), although I wrote #SOME_STRING in the document before exporting to PDF. Why not, you might ask? Because the output of pdf2htmlEX can look something like this:

<div class="t m0 x5 h5 ya ff4 fs3 fc0 sc0 ls0 ws0">#SOME_ST<span class="_ _5"></span>RING </div>

See that empty span-tag with a _ and _5 class? Yep, that prevents me from replacing my word. The _5 class simply has some width (like width: 0.9889px).

In this case, how would I replace #SOME_ST<span class="_ _5"></span>RING with something else?

Here are some cases:

(#SOME_STRING)          #SOME_ST<span class="_ _5"></span>RING
(#SOME_OTHER_STRING)    #SOME_<span class="_ _7"></span>OTHER_ST<span class="_ _5"></span>RING

I'm kind of lost here, because I can't remove all the _5 elements, because the class is randomized everytime I change something in the document.

EDIT: So I basically need a way to filter out the HTML tags from my own Key-Value pair, so I can replace the words like #SOME_STRING -> SOMETHING_ELSE.

1

There are 1 best solutions below

4
On

Try using regex to filter all empty spans:

var myRegex = new Regex(@"(?<emptyspan><span[^>]*></span>)", RegexOptions.None);
var strTargetString = @"<div class=""t m0 x5 h5 ya ff4 fs3 fc0 sc0 ls0 ws0"">#SOME_ST<span class=""_ _5""></span>RING </div> <span></span>";

foreach (Match myMatch in myRegex.Matches(strTargetString))
{
    var emptyString = myMatch.Groups["emptyspan"].Value;
    // replace or remove empty string ??
}