Parsing multiple elements with multiple tags from XML file - edited request

Question

Parsing multiple elements with multiple tags from XML file - edited request

96 Views Asked by Gregg Lund At 28 July 2025 at 08:07

I have accessed descriptions of publications from PubMed using their e-utilities using Python. There are many characteristics of each article that I want to extract for each article/author combination the "Id", "Author", "Title" and "DOI"to put into a database for analysis.

Python code I am using:

# Proof of concept

import requests
from bs4 import BeautifulSoup

# Step 1 - Prepare names to search
# list of members' full names
members_fullname = ['Downs,Stephen M', 'Delaney, Brendan C', 'Grout, Randall W', 'Michaud, Kaleb']

# takes member names and concatenates them to a string
members_concat = ' OR '.join([m+"[Full Author Name]" for m in members_fullname])

# Step 2 - prepares PubMed Entrez "esearch" query
search_term= members_concat
esearch_ids = []
esearch_request = requests.get("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",params={'db':'pubmed',"term":search_term,'cmd':"DetailsSearch","reldate":90})
i_esearch_ids = [i.text for i in     BeautifulSoup(esearch_request.text,'lxml-xml').find_all('Id')]
i_esearch_ids += i_esearch_ids

# Step 3 - Takes the results of the Pubmed esearch results         (i_esearch_ids) and submits PubMed Entrez "esummary" request
ids_concat = i_esearch_ids
search_ids = ",".join(ids_concat)
esummary_request = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi",params={'db':'pubmed','id':search_ids})

# Step 4 Parses esummary results to have id, Author, Title and DOI on single line
from lxml import etree
docs = """esummary_request"""
doc = etree.XML(docs)

for ds in doc.xpath('//DocSum'):
    id = ds.xpath('.//Id/text()')[0]
    title = ds.xpath('.//Item[@Name="Title"]/text()')[0]
    al = [author for author in     ds.xpath('.//Item[@Name="Author"]/text()')]
    print(id,",",title,",",al)

gives me errors on the last step: File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3326, in run_code exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 29, in doc = etree.XML(docs)

File "src/lxml/etree.pyx", line 3236, in lxml.etree.XML

File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument

File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc

File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc

File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc

File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult

File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError

File "", line 1 XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

there is something fundamental that I missing (new to python)

thanks to John and Giles

Original Q&A

There are 1 best solutions below

**Jack Fleeting** · Answer 1

Since you are dealing with an xml file, you should use an xml parser, like lxml, coupled with xpath.

In this case, it can be as simple as:

from lxml import etree
docs = """[your xml above]"""
doc = etree.XML(docs)

for ds in doc.xpath('//DocSum'):
    id = ds.xpath('.//Id/text()')[0]
    title = ds.xpath('.//Item[@Name="Title"]/text()')[0]
    al = [author for author in ds.xpath('.//Item[@Name="Author"]/text()')]
    print(id,",",title,",",al)

Output:

36762609 , Perspectives on ...and Tumor Necrosis Factor ...Study. , ['Ogdie A'...'Michaud K']
36706947 , Impact of key ...United States and Europe. , ['Walsh JA', ... 'Dennis N', 'Gossec L']

Parsing multiple elements with multiple tags from XML file - edited request

There are 1 best solutions below

Related Questions in XML

Related Questions in PARSING

Related Questions in BEAUTIFULSOUP

Related Questions in PUBMED-API

Trending Questions

Popular # Hahtags

Popular Questions