Converting .tei file to a .txt file

609 Views Asked by At

I have a .tei file of the following format.

<biblStruct xml:id="b0">
    <analytic>
        <title level="a" type="main">The Semantic Web</title>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">T</forename>
                <surname>Berners-Lee</surname>
            </persName>
        </author>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">J</forename>
                <surname>Hendler</surname>
            </persName>
        </author>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">O</forename>
                <surname>Lassilia</surname>
            </persName>
        </author>
    </analytic>
    <monogr>
        <title level="j">Scientific American</title>
        <imprint>
            <date type="published" when="2001-05" />
        </imprint>
    </monogr>
</biblStruct>

I want to convert the above file to .txt format which looks like this :

T. Berners-Lee, J. Hendler and O. Lassilia. ‘The Semantic Web’, Scientific American,May 2001

I tried using the following piece of code:

tree = ET.parse(path)
root = tree.getroot()
s = ""
for childs in root:
    for child in childs:
        s= s+child.text

The problem with the above code is that the loop executes sequentially and the string is not in the sequential format.

Secondly, there might be even more inner loops. Extracting something inside inner loops without manually checking is also problematic. Please help me with this

1

There are 1 best solutions below

0
On

I know that your looking for a Python solution, but because XSLT is such a convenient alternative and a perfect fit for an .xml file, I'm posting an XSLT solution anyway.

I guess it can be easily integrated into your Python solution.
So this is the necessary XSLT:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:month="http://month.com">
    <xsl:output method="text" />
    <xsl:strip-space elements="*" />

    <month:month>
        <month name="Jan" />
        <month name="Feb" />
        <month name="Mar" />
        <month name="Apr" />
        <month name="May" />
        <month name="Jun" />
        <month name="Jul" />
        <month name="Aug" />
        <month name="Sep" />
        <month name="Oct" />
        <month name="Nov" />
        <month name="Dec" />
    </month:month>

    <xsl:template match="author[position()=1]">
        <xsl:value-of select="concat(tei:persName/tei:forename, '. ',tei:persName/tei:surname)" />
    </xsl:template>    

    <xsl:template match="author">
        <xsl:value-of select="concat(', ',tei:persName/tei:forename, '. ',tei:persName/tei:surname)" />
    </xsl:template>

    <xsl:template match="author[last()]">
        <xsl:value-of select="concat(' and ',tei:persName/tei:forename, '. ',tei:persName/tei:surname)" />
    </xsl:template>

    <xsl:template match="/biblStruct">
        <xsl:apply-templates select="analytic/author" />
        <xsl:variable name="mon" select="number(substring(monogr/imprint/date/@when,6,2))" />
        <xsl:value-of select='concat(" &apos;",analytic/title,"&apos;",", ",monogr/title, ", ")' />   
        <xsl:value-of select="document('')/xsl:stylesheet/month:month/month[$mon]/@name" />
        <xsl:value-of select="concat(' ',/xsl:stylesheet/month:month[substring(monogr/imprint/date/@when,5,2)],substring(monogr/imprint/date/@when,1,4))" />
    </xsl:template>

</xsl:stylesheet>

You don't have to know much about XSLT to understand this code:
There are three templates matching author elements - one matching the first match, one matching the last() match, and one matching all in between. They differ only in handling the separators like , and and.

The last template handles the whole XML and combines the output of the other three templates. It also manages to transform the numerical month number to a string by referencing the month:month data island.

You should also look at the defined namespaces of the xsl:stylesheet element:

  • One for XSL : http://www.w3.org/1999/XSL/Transform
  • One for TEI : http://www.tei-c.org/ns/1.0
  • One for month: http://month.com for the data island

I hope that I have made a convincing case for using an XSLT file to do the transformation. The xsl:output element does specify the desired text output target with method="text".