XSLT replace substring with element while preserving other inline-elements

38 Views Asked by At

I have an xml-document with text and one with a list of words. I want to search the words from the list in the text and enclose them in a new tag while leaving everything else as is. In short, the XSLT should manage three things:

  1. Preserve all existings elements and attributes including inline elements
  2. Identify all words that appear in an external list of words
  3. Replace these words with an element (which ideally references the id included in the external list of words)

I managed to do some of these things, however I have trouble with bringing it all together and creating the desired output.

The input:

<doc>
    <header>Document with example sentences</header>
    <text>
        <div type="sentence" n="1">They<note>buyers</note> bought an apple and a banana.</div>
        <div type="sentence" n="2">They<note>shop</note> only had a strawberry and an apple left.</div>
    </text>
</doc>

The list:

<list>
    <fruit id="001">
        <english>apple</english>
        <translations>Apfel, pomme</translations>
    </fruit>
    <fruit id="002">
        <english>banana</english>
        <translations>Banane, banane</translations>
    </fruit>
    <fruit id="003">
        <english>strawberry</english>
        <translations>Erdbeere, strawberry</translations>
    </fruit>
</list>

The desired output:

<doc>
    <header>Document with example sentences</header>
    <text>
        <div type="sentence" n="1">They<note>buyers</note> bought <fruit ref="#001">apple</fruit> and a <fruit ref="#002">banana</fruit>.</div>
        <div type="sentence" n="2"> They<note>shop</note> only had <fruit ref="#003">strawberries</fruit> and <fruit ref="#001">apples</fruit>left.</div>
    </text>
</doc>

I've tried two things so far. The first manages to identify the words from the list in the text, the second one manages to replace words with templates. I can't figure out how to do both at the same time AND preserve all other elements in the document.

Identifying words from list in text

    <xsl:template match="/ | @*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template> 
    
    <xsl:variable name="list" select="document('liste.xml')"/>
    
    <xsl:template match="div">
        <xsl:variable name="text" select="."/>
        
        <xsl:copy>    
            
            <xsl:apply-templates select="@*|node()"/>
        
            <xsl:for-each select="$list/list/fruit">
                <xsl:variable name="english" select="english"/>
                <xsl:if test="contains($text,$english)">
                    <xsl:element name="identified_fruit">
                        <xsl:value-of select="$english"/>
                    </xsl:element>
                </xsl:if>
            </xsl:for-each>
        </xsl:copy>
        
    </xsl:template>

Replacing words with elements:

 <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@*, node()"/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="div">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="text()" mode="wrap">
                <xsl:with-param name="words" as="xs:string+" select="'banana', 'apple', 'strawberry'"/>
            </xsl:apply-templates>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="text()" mode="wrap">
        <xsl:param name="words" as="xs:string+"/>
        <xsl:param name="wrapper-name" as="xs:string" select="'fruit'"/>
        <xsl:analyze-string select="." regex="{string-join($words, '|')}">
            <xsl:matching-substring>
                <xsl:element name="{$wrapper-name}">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>

How can I combine the two AND preserve all other elements in the ?

Any advice would be greatly appreciated!

Best, RaBa

1

There are 1 best solutions below

3
michael.hor257k On

It's not exactly clear what can be hard-coded. Perhaps this could work for you:

XSLT 2.0

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<xsl:param name="list" select="document('liste.xml')"/>

<xsl:key name="fruit" match="fruit" use="english" />

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="div/text()">
    <xsl:analyze-string select="." regex="{string-join($list/list/fruit/english, '|')}">
        <xsl:matching-substring>
            <fruit ref="{key('fruit', ., $list)/@id}">
                <xsl:value-of select="."/>
            </fruit>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
            <xsl:value-of select="."/>
        </xsl:non-matching-substring>
    </xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>

Note that this looks for patterns, not words. If the input contains green applesauce it will be returned as green <fruit ref="001">apple</fruit>sauce.