I have an excel list of DOIs of papers I'm interested in. Based on this list, I would like to download all the papers.
I tried to do it with request, as recommended in their documentation. But the pdf files I get are damaged. They are just some KB big. I changed the chunk_size several times from None till 1024*1024 and I have read many posts already. Nothing helps.
Please, what are your ideas?
import pandas as pd
import os
import requests
def get_pdf(doi, file_to_save_to):
url = 'http://api.elsevier.com/content/article/doi:'+doi+'?view=FULL'
headers = {
'X-ELS-APIKEY': "keykeykeykeykeykey",
'Accept': 'application/pdf'
}
r = requests.get(url, stream=True, headers=headers)
if r.status_code == 200:
for chunk in r.iter_content(chunk_size=1024*1024):
file_to_save_to.write(chunk)
return True
doi_list = pd.read_excel('list.xls')
doi_list.columns = ['DOIs']
count = 0
for doi in doi_list['DOIs']:
doi = doi.replace('DOI:','')
pdf = doi.replace('/','%')
if not os.path.exists(f'path/{pdf}.pdf'):
file = open(f'path/{pdf}.pdf', 'wb')
get_pdf(doi, file)
count += 1
print(f"Dowloaded: {count} of {len(doi_list['DOIs'])} articles")
I think your problem is the
return True
infor chunk in r.iter_content
. With that line, you'll only ever write one chunk of the PDF of sizechunk_size
.You should also open files using
with
; as is, you'll never close the file handles.