"[Pers..." /> "[Pers..." /> "[Pers..."/>

Applying redactions in the form of string substitutions to HTML documents using XSLT

329 Views Asked by At

I have a large number of HTML (and possibly other xml) documents that I need to redact.

The redactions are typically of the form "John Doe" -> "[Person A]". The text to be redacted may be in headers or paragraphs, but will almost always be in paragraphs.

Simple string substitutions really. Not very complicated things.

However, I do want to preserve document structure, and I would prefer to not reinvent any wheels. String substitution in the document text may do the job, but also may break document structure, so it will be a last option.

Right now I have stared at XSLT for an hour and tried to force "str:replace" to do my bidding. I will spare you from viewing me feeble attempts that didn't work, but I will ask this: Is there a simple and know way to apply my redactions using XSLT, and could you post it here?

Thank you in advance.

Update: at the request of Martin Honnen I'm adding my input files, as well as the command I used to get the latest error message. From this it will be apparent that I'm a complete n00b when it comes to XSLT :-)

.html file:


    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    <html>
      <head>
        <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
        <title>TodaysDate</title>
        <meta name="created" content="2020-11-04T30:45:00"/>
      </head>
      <body>
        <ol start="2">
          <li><p> John Doe on 9. fux 2057 together with Henry
          Fluebottom formed the company Doe &; Fluebottom Widgets
          Inc. </p>
        </ol>
      </body>
    </html>

The XSLT transformation file:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        >
<xsl:template match="p">
  <xsl:copy>
<xsl:attribute name="matchesPattern">
  <xsl:copy-of select='str:replace("John Doe", ".*",  "[Person A]")'/>
</xsl:attribute>
  <xsl:copy-of select='str:replace("Henry Fluebottom", ".*",  "[Person B]")'/>
  </xsl:copy>
</xsl:template>
</xsl:stylesheet>

The command and the output:

$  xsltproc -html transform.xsl example.html
xmlXPathCompOpEval: function replace bound to undefined prefix str
xmlXPathCompiledEval: 2 objects left on the stack.
<?xml version="1.0"?>



    TodaysDate




      <p matchesPattern=""/>  

$ 
3

There are 3 best solutions below

2
On BEST ANSWER

xsltproc is based on libxslt and that way supports various EXSLT functions like str:replace, to use it you will need to declare the namespace

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:str="http://exslt.org/strings"
    exclude-result-prefixes="str"
    version="1.0">

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="p//text()">
        <xsl:value-of select="str:replace(., 'John Doe', '[Person A]')"/>
    </xsl:template>

</xsl:stylesheet>
0
On

The first problem is to find an XSLT processor that actually supports string replacement. The replace() function is standard in XSLT 2.0+, but does not exist in XSLT 1.0. Some XSLT 1.0 processors support an extension function str:replace() in a different namespace, but at the very least, you need to add the namespace declaration xmlns:str="http://exslt.org/strings" to your stylesheet in order to locate the function. I don't know if that will work (I don't know if there is any way of using this function with xsltproc); my advice would be to use an XSLT 2.0+ processor instead.

The next problem is the way you are invoking the function. Typically, a correct invocation would be

replace(., "John Doe", "[Person A]")

though you will have to jump through a few more hoops to make multiple replacements on the same string.

I've no idea what you are trying to achieve with the <xsl:attribute name="matchesPattern"> instruction.

1
On

There is no simple way in XSLT 1.0 to perform multiple replacements on the same string. You need to use a recursive named template, performing one replacement operation at a time, then moving to the next instance of the current find string or - when no next instance exists - to the next find/replace pair.

Consider the following example:

Input

<html>
    <head>
        <title>John Doe and Henry Fluebottom</title>
    </head>
    <body>
        <p>John Doe is a person. John Doe on 9. fux 2057 together with Henry Fluebottom formed the company Doe &amp; Fluebottom Widgets Inc. Henry Fluebottom is also a person.</p>
    </body>
</html>

XSLT 1.0 (+ EXSLT node-set() function)

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
extension-element-prefixes="exsl">
<xsl:output method="xml" omit-xml-declaration="yes" version="1.0" encoding="utf-8" indent="yes"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:variable name="dictionary">
    <entry find="John Doe" replace="[Person A]"/>
    <entry find="Henry Fluebottom" replace="[Person B]"/>
</xsl:variable>

<xsl:template match="text()">
    <xsl:call-template name="multi-replace">
        <xsl:with-param name="string" select="normalize-space(.)"/>
        <xsl:with-param name="entries" select="exsl:node-set($dictionary)/entry"/>"/>
    </xsl:call-template>
</xsl:template>

<xsl:template name="multi-replace">
    <xsl:param name="string"/>
    <xsl:param name="entries"/>
    <xsl:choose>
        <xsl:when test="$entries">
            <xsl:call-template name="multi-replace">
                <xsl:with-param name="string">
                    <xsl:call-template name="replace">
                        <xsl:with-param name="string" select="$string"/>
                        <xsl:with-param name="search-string" select="$entries[1]/@find"/>
                        <xsl:with-param name="replace-string" select="$entries[1]/@replace"/>
                    </xsl:call-template>
                </xsl:with-param>
                <xsl:with-param name="entries" select="$entries[position() > 1]"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$string"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

<xsl:template name="replace">
    <xsl:param name="string"/>
    <xsl:param name="search-string"/>
    <xsl:param name="replace-string"/>
    <xsl:choose>
        <xsl:when test="contains($string, $search-string)">
            <xsl:value-of select="substring-before($string, $search-string)"/>
            <xsl:value-of select="$replace-string"/>
            <xsl:call-template name="replace">
                <xsl:with-param name="string" select="substring-after($string, $search-string)"/>
                <xsl:with-param name="search-string" select="$search-string"/>
                <xsl:with-param name="replace-string" select="$replace-string"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$string"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

</xsl:stylesheet>

Result

<html>
    <head>
        <title>[Person A] and [Person B]</title>
    </head>
    <body>
        <p>[Person A] is a person. [Person A] on 9. fux 2057 together with [Person B] formed the company Doe &amp; Fluebottom Widgets Inc. [Person B] is also a person.</p>
    </body>
</html>

As you can see, this replaces all instances of the search strings anywhere in the input document (except for attributes), while preserving the document's structure.


Note that the input in your example does not actually contain the "Henry Fluebottom" search string. You might want to get around that by calling the first template with:

<xsl:with-param name="string" select="normalize-space(.)"/>

instead of:

<xsl:with-param name="string" select="."/>