Replace word even if it has empty HTML tags between it, which breaks it up

Question

Replace word even if it has empty HTML tags between it, which breaks it up

224 Views Asked by MortenMoulder At 05 April 2018 at 10:55

So this is a rather odd question, I know that. I use a tool called pdf2htmlEX, which converts a PDF to HTML. So far the results has been pretty damn impressive. I have yet seen a single error in all the PDFs I have converted to HTML.

With this HTML, I need to replace some strings dynamically with C#. However, I can't simply say line.Replace("#SOME_STRING", "Another string"), although I wrote #SOME_STRING in the document before exporting to PDF. Why not, you might ask? Because the output of pdf2htmlEX can look something like this:

<div class="t m0 x5 h5 ya ff4 fs3 fc0 sc0 ls0 ws0">#SOME_ST<span class="_ _5"></span>RING </div>

See that empty span-tag with a _ and _5 class? Yep, that prevents me from replacing my word. The _5 class simply has some width (like width: 0.9889px).

In this case, how would I replace #SOME_ST<span class="_ _5"></span>RING with something else?

Here are some cases:

(#SOME_STRING)          #SOME_ST<span class="_ _5"></span>RING
(#SOME_OTHER_STRING)    #SOME_<span class="_ _7"></span>OTHER_ST<span class="_ _5"></span>RING

I'm kind of lost here, because I can't remove all the _5 elements, because the class is randomized everytime I change something in the document.

EDIT: So I basically need a way to filter out the HTML tags from my own Key-Value pair, so I can replace the words like #SOME_STRING -> SOMETHING_ELSE.

Original Q&A

There are 1 best solutions below

**richej** · Answer 1 · 2018-04-05T11:06:22.053000

Try using regex to filter all empty spans:

var myRegex = new Regex(@"(?<emptyspan><span[^>]*></span>)", RegexOptions.None);
var strTargetString = @"<div class=""t m0 x5 h5 ya ff4 fs3 fc0 sc0 ls0 ws0"">#SOME_ST<span class=""_ _5""></span>RING </div> <span></span>";

foreach (Match myMatch in myRegex.Matches(strTargetString))
{
    var emptyString = myMatch.Groups["emptyspan"].Value;
    // replace or remove empty string ??
}

Replace word even if it has empty HTML tags between it, which breaks it up

There are 1 best solutions below

Related Questions in C#

Related Questions in PDF2HTMLEX

Trending Questions

Popular # Hahtags

Popular Questions