I'm trying to get bulk data from Europe PMC annotations api in python

112 Views Asked by At

my code is

if name == 'main': json_data=requests.get("https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds?articleIds=PMC%3A4771370&section=Abstract&provider=Europe%20PMC&format=JSON").content r=json.loads(json_data) df = json_to_dataframe(r) print(df)

My only problem is how can run this for multiple IDs, like i have atleast thousands of ids in a file. Please help I'm using python.

2

There are 2 best solutions below

4
On BEST ANSWER

Assuming you know Python and can get all the IDs from the file into a list article_ids, you can use the following script:

URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'

article_ids = ['PMC:4771370']

for article_id in article_ids:
    params = {
        'articleIds': article_id,
        'section': 'Abstract',
        'provider': 'Europe PMC',
        'format': 'JSON'
    }
    json_data = requests.get(URL, params=params).content
    r = json.loads(json_data)
    df = json_to_dataframe(r)
    print(df)
1
On

After analyzing the shared URL and reading the URL Encodings article, I observed that each value of annotationByArticleIDs has format of SOURCE:EXTERNAL_ID format.

TEST1: If you hit the url:

https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds?articleIds=PMC

Output is: It must contain values with format SOURCE:EXTERNAL_ID where SOURCE must have one of the following values [PMC, MED, PAT, AGR, CBA, HIR, CTX, ETH, CIT, PPR, NBK] and EXTERNAL_ID must be a number when SOURCE=PMC

  • Above output shows possible list of sources
  • Each source is separated by EXTERNAL_ID using colon
  • Colon is represented by %3A in URL Encoding article
  • In order to separate one pair of value from another value, you could use comma operator
  • Comma is represented using %2C in the same URL encoding article

ANSWER: So to fetch multiple articles you could generate string of article ids in the format SOURCE1:EXTERNAL_ID1,SOURCE2:EXTERNAL_ID2 .... SOURCE3:EXTERNAL_ID3 and append in the main url

Few Limitations:

  • Max URL Length could be 2048 characters
  • Depending upon possible ids, you will be able to fetch around 150 to 200 articles
  • You could loop over a batch of 150 and then fetch the required information