Bulk Data from Europe PMC annotation api

164 Views Asked by At

i have a pmc.txt file which contains atleast 20k pmc ids, and the api will only take i think 1000 request each time. i have written the code for one id, but i'm not able to do for the whole file, below is my main code. Please help.

if __name__ == '__main__':
URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'


article_ids = ['PMC:4771370']

for article_id in article_ids:
  params = {
    'articleIds': article_id,
    'section': 'Abstract',
    'provider': 'Europe PMC',
    'format': 'JSON'
  }
json_data = requests.get(URL, params=params).content
r = json.loads(json_data)
df = json_to_dataframe(r)
print(df)
df.to_csv("data.csv")
2

There are 2 best solutions below

2
On BEST ANSWER

you can read in the data from the file like so:

with open('pmc.txt', 'r') as file:
    article_ids = [item.replace('\n', '') for item in file]

which you can do instead of article_ids = ['PMC:4771370']

Though you are going to have to save your files with a different name (you will have 20,000 files then or instead you have to append your json data to the dataframe before you make it a csv)

Therefore this would be something that you would do to separate the ids into chunks and use them in a single articleid parameter

if __name__ == '__main__':
    URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'

    with open('corpus_processing_input.txt', 'r') as file:
        article_ids = [item.replace('\n', '') for item in file]
    
    # API only allows 1-8 ids sent at a time 
    chunks = [article_ids[x:x+8] for x in range(0, len(article_ids), 8)]

    for count, article_ids in enumerate(chunks):
        params = {
            'articleIds': ','.join(article_ids),
            'section': 'Abstract',
            'provider': 'Europe PMC',
            'format': 'JSON'
        }
        json_data = requests.get(URL, params=params).content
        r = json.loads(json_data)
        df = json_to_dataframe(r)
        print(df)
        df.to_csv(f"data{count}.csv")
        
0
On

You can use grequests. You can try setting stream=False in grequests.get, or call explicitly response.close() after reading response.content. It's discussed in detail here

Additionally, you can also test requests-futures. Grequests is faster but brings monkey patching and additional problems with dependencies. requests-futures is several times slower than grequests but simply wrapped requests into ThreadPoolExecutor can be as fast as grequests, but without external dependencies. Reference here.