Parsing multiple elements with multiple tags from XML file - edited request

104 Views Asked by At

I have accessed descriptions of publications from PubMed using their e-utilities using Python. There are many characteristics of each article that I want to extract for each article/author combination the "Id", "Author", "Title" and "DOI"to put into a database for analysis.

Python code I am using:

# Proof of concept

import requests
from bs4 import BeautifulSoup

# Step 1 - Prepare names to search
# list of members' full names
members_fullname = ['Downs,Stephen M', 'Delaney, Brendan C', 'Grout, Randall W', 'Michaud, Kaleb']

# takes member names and concatenates them to a string
members_concat = ' OR '.join([m+"[Full Author Name]" for m in members_fullname])

# Step 2 - prepares PubMed Entrez "esearch" query
search_term= members_concat
esearch_ids = []
esearch_request = requests.get("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",params={'db':'pubmed',"term":search_term,'cmd':"DetailsSearch","reldate":90})
i_esearch_ids = [i.text for i in     BeautifulSoup(esearch_request.text,'lxml-xml').find_all('Id')]
i_esearch_ids += i_esearch_ids

# Step 3 - Takes the results of the Pubmed esearch results         (i_esearch_ids) and submits PubMed Entrez "esummary" request
ids_concat = i_esearch_ids
search_ids = ",".join(ids_concat)
esummary_request = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi",params={'db':'pubmed','id':search_ids})

# Step 4 Parses esummary results to have id, Author, Title and DOI on single line
from lxml import etree
docs = """esummary_request"""
doc = etree.XML(docs)

for ds in doc.xpath('//DocSum'):
    id = ds.xpath('.//Id/text()')[0]
    title = ds.xpath('.//Item[@Name="Title"]/text()')[0]
    al = [author for author in     ds.xpath('.//Item[@Name="Author"]/text()')]
    print(id,",",title,",",al)

gives me errors on the last step: File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3326, in run_code exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 29, in doc = etree.XML(docs)

File "src/lxml/etree.pyx", line 3236, in lxml.etree.XML

File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument

File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc

File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc

File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc

File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult

File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError

File "", line 1 XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

there is something fundamental that I missing (new to python)

thanks to John and Giles

1

There are 1 best solutions below

23
On

Since you are dealing with an xml file, you should use an xml parser, like lxml, coupled with xpath.

In this case, it can be as simple as:

from lxml import etree
docs = """[your xml above]"""
doc = etree.XML(docs)

for ds in doc.xpath('//DocSum'):
    id = ds.xpath('.//Id/text()')[0]
    title = ds.xpath('.//Item[@Name="Title"]/text()')[0]
    al = [author for author in ds.xpath('.//Item[@Name="Author"]/text()')]
    print(id,",",title,",",al)

Output:

36762609 , Perspectives on ...and Tumor Necrosis Factor ...Study. , ['Ogdie A'...'Michaud K']
36706947 , Impact of key ...United States and Europe. , ['Walsh JA', ... 'Dennis N', 'Gossec L']