I have accessed descriptions of publications from PubMed using their e-utilities using Python. There are many characteristics of each article that I want to extract for each article/author combination the "Id", "Author", "Title" and "DOI"to put into a database for analysis.
Python code I am using:
# Proof of concept
import requests
from bs4 import BeautifulSoup
# Step 1 - Prepare names to search
# list of members' full names
members_fullname = ['Downs,Stephen M', 'Delaney, Brendan C', 'Grout, Randall W', 'Michaud, Kaleb']
# takes member names and concatenates them to a string
members_concat = ' OR '.join([m+"[Full Author Name]" for m in members_fullname])
# Step 2 - prepares PubMed Entrez "esearch" query
search_term= members_concat
esearch_ids = []
esearch_request = requests.get("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",params={'db':'pubmed',"term":search_term,'cmd':"DetailsSearch","reldate":90})
i_esearch_ids = [i.text for i in BeautifulSoup(esearch_request.text,'lxml-xml').find_all('Id')]
i_esearch_ids += i_esearch_ids
# Step 3 - Takes the results of the Pubmed esearch results (i_esearch_ids) and submits PubMed Entrez "esummary" request
ids_concat = i_esearch_ids
search_ids = ",".join(ids_concat)
esummary_request = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi",params={'db':'pubmed','id':search_ids})
# Step 4 Parses esummary results to have id, Author, Title and DOI on single line
from lxml import etree
docs = """esummary_request"""
doc = etree.XML(docs)
for ds in doc.xpath('//DocSum'):
id = ds.xpath('.//Id/text()')[0]
title = ds.xpath('.//Item[@Name="Title"]/text()')[0]
al = [author for author in ds.xpath('.//Item[@Name="Author"]/text()')]
print(id,",",title,",",al)
gives me errors on the last step: File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3326, in run_code exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 29, in doc = etree.XML(docs)
File "src/lxml/etree.pyx", line 3236, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
File "", line 1 XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
there is something fundamental that I missing (new to python)
thanks to John and Giles
Since you are dealing with an
xmlfile, you should use anxmlparser, likelxml, coupled withxpath.In this case, it can be as simple as:
Output: