So this is a rather odd question, I know that. I use a tool called pdf2htmlEX
, which converts a PDF to HTML. So far the results has been pretty damn impressive. I have yet seen a single error in all the PDFs I have converted to HTML.
With this HTML, I need to replace some strings dynamically with C#. However, I can't simply say line.Replace("#SOME_STRING", "Another string")
, although I wrote #SOME_STRING
in the document before exporting to PDF. Why not, you might ask? Because the output of pdf2htmlEX
can look something like this:
<div class="t m0 x5 h5 ya ff4 fs3 fc0 sc0 ls0 ws0">#SOME_ST<span class="_ _5"></span>RING </div>
See that empty span-tag with a _
and _5
class? Yep, that prevents me from replacing my word. The _5
class simply has some width (like width: 0.9889px
).
In this case, how would I replace #SOME_ST<span class="_ _5"></span>RING
with something else?
Here are some cases:
(#SOME_STRING) #SOME_ST<span class="_ _5"></span>RING
(#SOME_OTHER_STRING) #SOME_<span class="_ _7"></span>OTHER_ST<span class="_ _5"></span>RING
I'm kind of lost here, because I can't remove all the _5
elements, because the class is randomized everytime I change something in the document.
EDIT: So I basically need a way to filter out the HTML tags from my own Key-Value pair, so I can replace the words like #SOME_STRING -> SOMETHING_ELSE
.
Try using regex to filter all empty spans: