How to find a specific tag in an XML file and then access its parent tag with Python and minidom

969 Views Asked by At

I'm trying to write some code that will search through an XML file of articles for a particular DOI contained within a tag. When it has found the correct DOI I'd like it to then access the <title> and <abstract> text for the article associated with that DOI.

My XML file is in this format:

<root>
 <article>
  <number>
   0 
  </number>
  <DOI>
   10.1016/B978-0-12-381015-1.00004-6 
  </DOI>
  <title>
   The patagonian toothfish biology, ecology and fishery. 
  </title>
  <abstract>
   lots of abstract text
  </abstract>
 </article>
 <article>
  ...All the article tags as shown above...
 </article>
</root>

I'd like the script to find the article with the DOI 10.1016/B978-0-12-381015-1.00004-6 (for example) and then for me to be able to access the <title> and <abstract> tags within the corresponding <article> tag.

So far I've tried to adapt code from this question:

from xml.dom import minidom

datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml')
xmldoc = minidom.parse(datasource)   

#looking for: 10.1016/B978-0-12-381015-1.00004-6

matchingNodes = [node for node in xmldoc.getElementsByTagName("DOI") if node.firstChild.nodeValue == '10.1016/B978-0-12-381015-1.00004-6']

for i in range(len(matchingNodes)):
    DOI = str(matchingNodes[i])
    print DOI

But I'm not entirely sure what I'm doing!

Thanks for any help.

2

There are 2 best solutions below

3
On BEST ANSWER

imho - just look it up in the python docs! try this (not tested):

from xml.dom import minidom

xmldoc = minidom.parse(datasource)   

def get_xmltext(parent, subnode_name):
    node = parent.getElementsByTagName(subnode_name)[0]
    return "".join([ch.toxml() for ch in node.childNodes])

matchingNodes = [node for node in xmldoc.getElementsByTagName("article")
           if get_xmltext(node, "DOI") == '10.1016/B978-0-12-381015-1.00004-6']

for node in matchingNodes:
    print "title:", get_xmltext(node, "title")
    print "abstract:", get_xmltext(node, "abstract")
0
On

Is minidom a requirement? It would be quite easy to parse it with lxml and XPath.

from lxml import etree
datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml').read()
tree = etree.fromstring(datasource)
path = tree.xpath("//article[DOI="10.1016/B978-0-12-381015-1.00004-6") 

This will get you the article with the DOI specified.

Also, it seems that there is whitespace between the tags. I dunno if this because of the Stackoverflow formatting or not. This is probably why you cannot match it with minidom.