I have a .tei
file of the following format.
<biblStruct xml:id="b0">
<analytic>
<title level="a" type="main">The Semantic Web</title>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">T</forename>
<surname>Berners-Lee</surname>
</persName>
</author>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">J</forename>
<surname>Hendler</surname>
</persName>
</author>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">O</forename>
<surname>Lassilia</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Scientific American</title>
<imprint>
<date type="published" when="2001-05" />
</imprint>
</monogr>
</biblStruct>
I want to convert the above file to .txt
format which looks like this :
T. Berners-Lee, J. Hendler and O. Lassilia. ‘The Semantic Web’, Scientific American,May 2001
I tried using the following piece of code:
tree = ET.parse(path)
root = tree.getroot()
s = ""
for childs in root:
for child in childs:
s= s+child.text
The problem with the above code is that the loop executes sequentially and the string is not in the sequential format.
Secondly, there might be even more inner loops. Extracting something inside inner loops without manually checking is also problematic. Please help me with this
I know that your looking for a Python solution, but because XSLT is such a convenient alternative and a perfect fit for an
.xml
file, I'm posting an XSLT solution anyway.I guess it can be easily integrated into your Python solution.
So this is the necessary XSLT:
You don't have to know much about XSLT to understand this code:
There are three templates matching
author
elements - one matching the first match, one matching thelast()
match, and one matching all in between. They differ only in handling the separators like,
andand
.The last template handles the whole XML and combines the output of the other three templates. It also manages to transform the numerical month number to a string by referencing the
month:month
data island.You should also look at the defined namespaces of the
xsl:stylesheet
element:http://www.w3.org/1999/XSL/Transform
http://www.tei-c.org/ns/1.0
http://month.com
for the data islandI hope that I have made a convincing case for using an XSLT file to do the transformation. The
xsl:output
element does specify the desired text output target withmethod="text"
.