I have accessed descriptions of publications from PubMed using their e-utilities using Python. There are many characteristics of each article that I want to extract for each article/author combination the "Id", "Author", "Title" and "DOI"to put into a database for analysis.
Python code I am using:
# Proof of concept
import requests
from bs4 import BeautifulSoup
# Step 1 - Prepare names to search
# list of members' full names
members_fullname = ['Downs,Stephen M', 'Delaney, Brendan C', 'Grout, Randall W', 'Michaud, Kaleb']
# takes member names and concatenates them to a string
members_concat = ' OR '.join([m+"[Full Author Name]" for m in members_fullname])
# Step 2 - prepares PubMed Entrez "esearch" query
search_term= members_concat
esearch_ids = []
esearch_request = requests.get("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",params={'db':'pubmed',"term":search_term,'cmd':"DetailsSearch","reldate":90})
i_esearch_ids = [i.text for i in BeautifulSoup(esearch_request.text,'lxml-xml').find_all('Id')]
i_esearch_ids += i_esearch_ids
# Step 3 - Takes the results of the Pubmed esearch results (i_esearch_ids) and submits PubMed Entrez "esummary" request
ids_concat = i_esearch_ids
search_ids = ",".join(ids_concat)
esummary_request = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi",params={'db':'pubmed','id':search_ids})
# Step 4 Parses esummary results to have id, Author, Title and DOI on single line
from lxml import etree
docs = """esummary_request"""
doc = etree.XML(docs)
for ds in doc.xpath('//DocSum'):
id = ds.xpath('.//Id/text()')[0]
title = ds.xpath('.//Item[@Name="Title"]/text()')[0]
al = [author for author in ds.xpath('.//Item[@Name="Author"]/text()')]
print(id,",",title,",",al)
gives me errors on the last step: File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3326, in run_code exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 29, in doc = etree.XML(docs)
File "src/lxml/etree.pyx", line 3236, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
File "", line 1 XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
there is something fundamental that I missing (new to python)
thanks to John and Giles
Since you are dealing with an
xml
file, you should use anxml
parser, likelxml
, coupled withxpath
.In this case, it can be as simple as:
Output: