Automatically add an attribute and values based on Latinized characters between element

88 Views Asked by At

I'm using Oxygen XML editor 23.1. I'm working on a large corpus of text and would like to use the transformation to automatically add certain attributes and values to certain elements. In this case, I have a @correspUnic attribute, created to add ugaritic glyphs from unicode decimal. The values of @correspUnic depend on the Latinized characters between the elements. Here's an example of tei encoding:

<w>bn</w>
<g>.</g>
<name>qdš</name>
<w>
  <seg>ʾa</seg>
  <unclear>b̊</unclear>
</w>

Expected result:

<w correspUnic='&#66433;&#66448;'>bn</w>
<g correspUnic='&#66463;'>.</g>
<name correspUnic='&#66454;&#66436;&#66444;'>qdš</name>
<w>
  <seg correspUnic='&#x10380;'>ʾa</seg>
  <unclear correspUnic='&#66433;'>b̊</unclear>
</w>

I have tried several variants of an xsl transformation file, but I confess that after several hours, I close to give up. Here is the last code, which sadly doesn't work:

<!-- Define the str-split function -->
   <xsl:template name="str-split">
      <xsl:param name="input" />
      <xsl:param name="delimiter" select="''" />
      <xsl:choose>
         <xsl:when test="contains($input, $delimiter)">
            <xsl:variable name="first" select="substring-before($input, $delimiter)" />
            <xsl:variable name="rest" select="substring-after($input, $delimiter)" />
            <char>
               <xsl:value-of select="$first" />
            </char>
            <xsl:call-template name="str-split">
               <xsl:with-param name="input" select="$rest" />
               <xsl:with-param name="delimiter" select="$delimiter" />
            </xsl:call-template>
         </xsl:when>
         <xsl:otherwise>
            <char>
               <xsl:value-of select="$input" />
            </char>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>
   
   <!-- Define Unicode data directly in the variable -->
   <xsl:variable name="unicodeData">
      <data>
         <row>
            <latin>ʾa</latin>
            <Unicode>66432</Unicode>
         </row>
         <row>
            <latin>b</latin>
            <Unicode>66433</Unicode>
         </row>
         <row>
            <latin>g</latin>
            <Unicode>66434</Unicode>
         </row>
         <row>
            <latin>ḫ</latin>
            <Unicode>66435</Unicode>
         </row>
         <row>
            <latin>d</latin>
            <Unicode>66436</Unicode>
         </row>
       <!-- etc -->
      </data>
   </xsl:variable>
   
   <xsl:template match="/">
      <!-- Display the value of the variable $unicodeData -->
      <xsl:message select="$unicodeData" />
      
      <xsl:apply-templates/>
   </xsl:template>

   
   <!-- XSLT template for adding @correspUnic to w, g, unclear, name, seg, and supplied -->
   <xsl:template match="w | g | unclear | name | seg | supplied">
      <!-- Copy current element -->
      <xsl:copy>
         <!-- Apply rules to add @correspUnic to children -->
         <xsl:apply-templates select="node()" />
         <!-- Check whether the current element must have @correspUnic -->
         <xsl:if test="self::name or self::seg or self::supplied or self::w or self::g or self::unclear">
            <!-- Recover Latinized characters from textual descendants -->
            <xsl:variable name="latinized">
               <xsl:for-each select="descendant::text()">
                  <xsl:value-of select="." />
               </xsl:for-each>
            </xsl:variable>
            <!-- Check if Latinized characters are detected -->
            <xsl:if test="normalize-space($latinized)">
               <!-- Use the str-split function to split the string -->
               <xsl:variable name="correspUnicode">
                  <xsl:call-template name="str-split">
                     <xsl:with-param name="input" select="$latinized" />
                  </xsl:call-template>
               </xsl:variable>
               <!-- Add @correspUnic attribute with Unicode values -->
               <xsl:attribute name="correspUnic">
                  <xsl:for-each select="$correspUnicode/char">
                     <xsl:variable name="char" select="." />
                     <xsl:if test="normalize-space($char)">
                        <xsl:value-of select="concat('&amp;#', $unicodeData//row[latin = $char]/Unicode, ';')" />
                     </xsl:if>
                  </xsl:for-each>
               </xsl:attribute>
            </xsl:if>
         </xsl:if>
      </xsl:copy>
   </xsl:template>

As you can see, I added xsl:message to see any errors that would have a direct impact on adding the attribute and its values, but nothing...

Thank you very much in advance for your advice and suggestions.

3

There are 3 best solutions below

4
Vanessa On BEST ANSWER

Thanks to Martin who helped me solve the problem of displaying @correspUnic values. On the other hand, there was a problem displaying unicode decimal values of ʾa (66432), ʾi (66459), ʾu (66460) which were probably interpreted as two characters, but this is not the case: in Ugaritic, it is indeed a glyph. To get around the problem, I used regex. Then I had to do some additional processing to replace &amp; with &--which wasn't very simple, given that & is de facto understood as preceding an entity. I'm not saying it is the best solution, but it works.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:xs="http://www.w3.org/2001/XMLSchema"
   exclude-result-prefixes="#all"
   version="3.0">
   
   <xsl:mode on-no-match="shallow-copy"/>
   
   <!-- Define Unicode data directly in the variable -->
   <xsl:param name="unicodeData">
      <data>
         <row>
            <latin>ʾa</latin>
            <Unicode>66432</Unicode>
         </row>
         <row>
            <latin>b</latin>
            <Unicode>66433</Unicode>
         </row>
         <row>
            <latin>g</latin>
            <Unicode>66434</Unicode>
         </row>
         <row>
            <latin>ḫ</latin>
            <Unicode>66435</Unicode>
         </row>
         <row>
            <latin>d</latin>
            <Unicode>66436</Unicode>
         </row>
         <row>
            <latin>h</latin>
            <Unicode>66437</Unicode>
         </row>
         <!-- etc -->
      </data>
   </xsl:param>
   
   <xsl:key name="latin-to-unicode" match="row" use="latin"/>
   
   <xsl:character-map name="ugaritic">
      <xsl:output-character character="&#66432;" string="&amp;#66432;"/>
      <xsl:output-character character="&#66433;" string="&amp;#66433;"/>
      <xsl:output-character character="&#66434;" string="&amp;#66434;"/>
      <xsl:output-character character="&#66435;" string="&amp;#66435;"/>
      <xsl:output-character character="&#66436;" string="&amp;#66436;"/>
      <xsl:output-character character="&#66437;" string="&amp;#66437;"/>
      <!-- etc -->
   </xsl:character-map>

 <xsl:output method="xml" use-character-maps="ugaritic"/>
<!-- for example -->
<!-- Apply correspUnic attribute only to w elements whose text does not come from child elements unclear, seg, supplied -->
   <xsl:template match="w[(not(child::unclear) and not(child::seg) and not(child::supplied)) and text() and (not(@correspUnic) or string-length(normalize-space(@correspUnic)) = 0)]">
      <xsl:copy>
         <xsl:apply-templates select="@*"/>
         <xsl:attribute name="correspUnic">
            <xsl:apply-templates select="text()" mode="map"/>
         </xsl:attribute>
         <xsl:apply-templates/>
      </xsl:copy>
   </xsl:template>

<xsl:template match="text()" mode="map">
      <xsl:analyze-string select="." regex="ʾ[aiu]">
         <xsl:matching-substring>
            <xsl:variable name="matchedChar" select="." />
            <xsl:variable name="unicodeValue">
               <xsl:choose>
                  <xsl:when test="$matchedChar = 'ʾa'">66432</xsl:when>
                  <xsl:when test="$matchedChar = 'ʾi'">66459</xsl:when>
                  <xsl:when test="$matchedChar = 'ʾu'">66460</xsl:when>
               </xsl:choose>
            </xsl:variable>
            <!-- Create a Unicode string at once -->
            <xsl:variable name="unicodeString" select="codepoints-to-string($unicodeValue)"/>
            <!-- remove all &amp; -->
            <xsl:variable name="cleanedString" select="replace($unicodeString, '&amp;', '')"/>
            <xsl:sequence select="$cleanedString"/>
         </xsl:matching-substring>
         <xsl:non-matching-substring>
            <xsl:for-each select="string-to-codepoints(.) ! codepoints-to-string(.)">
               <xsl:sequence select="key('latin-to-unicode', ., $unicodeData)/Unicode => codepoints-to-string()"/>
            </xsl:for-each>
         </xsl:non-matching-substring>
      </xsl:analyze-string>
   </xsl:template>
   
   
</xsl:stylesheet>
2
Martin Honnen On

Perhaps the following helps, though I have not quite understood the whole lot of characters used:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0">

  <xsl:mode on-no-match="shallow-copy"/>
  
  <!-- Define Unicode data directly in the variable -->
  <xsl:param name="unicodeData">
      <data>
         <row>
            <latin>ʾa</latin>
            <Unicode>66432</Unicode>
         </row>
         <row>
            <latin>b</latin>
            <Unicode>66433</Unicode>
         </row>
         <row>
            <latin>n</latin>
            <Unicode>66448</Unicode>
         </row>
         <row>
            <latin>g</latin>
            <Unicode>66434</Unicode>
         </row>
         <row>
            <latin>ḫ</latin>
            <Unicode>66435</Unicode>
         </row>
         <row>
            <latin>d</latin>
            <Unicode>66436</Unicode>
         </row>
       <!-- etc -->
      </data>
  </xsl:param>
   
  <xsl:key name="latin-to-unicode" match="row" use="latin"/>
  
  <xsl:character-map name="ugaritic">
    <xsl:output-character character="&#66433;" string="&amp;#66433;"/>
    <xsl:output-character character="&#66448;" string="&amp;#66448;"/>
    <!-- ... -->
  </xsl:character-map>

  <xsl:output method="xml" use-character-maps="ugaritic"/>

  <xsl:template match="*[text()[normalize-space()]]">
    <xsl:copy>
      <xsl:attribute name="correspUnic">
        <xsl:apply-templates select="text()" mode="map"/>
      </xsl:attribute>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="text()" mode="map">
    <xsl:for-each select="string-to-codepoints(.) ! codepoints-to-string(.)">
      <xsl:sequence select="key('latin-to-unicode', ., $unicodeData)/Unicode => codepoints-to-string()"/>
    </xsl:for-each>
  </xsl:template>
  
</xsl:stylesheet>

Transforms <w>bn</w> into <w correspUnic="&#66433;&#66448;">bn</w>.

6
y.arazim On

Using the transliteration table from here, I came up with the following code (requires XSLT 2.0):

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="ASCII" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:variable name="latin">abgḫdhwzḥṭykšlmḏnẓspṣqrṯġtiuSʾ</xsl:variable>
<xsl:variable name="ugaritic"></xsl:variable>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="(w|g|name|seg)[text()]">
    <xsl:variable name="adjusted" select="replace(., 's2', 'S')" />
    <xsl:copy>
        <xsl:attribute name="correspUnic">
            <xsl:value-of select="translate($adjusted, $latin, $ugaritic)" />
        </xsl:attribute>
        <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

</xsl:stylesheet>

(I set the output encoding to ASCII just to be able to recognize the output characters).

Applying this to the following XML:

<root>
    <w>bn</w>
    <g>.</g>
    <name>qdš</name>
    <w>
      <seg>ʾa</seg>
      <unclear>b̊</unclear>
    </w>
</root>

I get:

<?xml version="1.0" encoding="ASCII"?>
<root>
   <w correspUnic="&#x10381;&#x10390;">bn</w>
   <g correspUnic=".">.</g>
   <name correspUnic="&#x10395;&#x10384;&#x1038c;">qd&#x161;</name>
   <w>
      <seg correspUnic="&#x1039d;&#x10380;">&#x2be;a</seg>
      <unclear>b&#x30a;</unclear>
   </w>
</root>

Apparently you have some more entries in your transliteration table, but that should be a very simple modification.