I have been working on retrieving all records from a OAI-PHM repository from various research institutions using the api function in Sickle program in Python. I have written a code that performs a consecutive harvesting that iterates over the records of the repository and saves the records as an XML-file as well as into a SQL-data.
However for some reason I am unable to retrieve all the records in the repository - there are missing records particularly between the years 2017-2020. If I perform a selective harvesting by date using the "from parameter" in the Sickle program I am able to retrieve some additional records, but not all of them.
I suspect the issue is due to the fact that some of the records in the OAI repository are empty and that Sickle stops harvesting records when encountering a record that contains no information.
I have set the optional parameter "ignore_deleted" to True in the code in order to skip deleted records. However I am unsure if its possible to add an additional parameter that skips a record that is empty?
Below is an excerpt of the code that specifies the consecutive harvesting of the OAI repository.
import datetime
from sickle import Sickle
api_list = [ \
"https://pure.itu.dk/ws/oai", \
]
date="2020-08.01"
last_retrieval="1950.01.01"
for api in api_list:
institution = ""
institution = inst_institution(api)
record_total=0
sickle = Sickle(api)
harvest_id = uuid.uuid4() # generating a random ID for the record.
recs = sickle.ListRecords(**{'metadataPrefix': 'ddf-mxd', 'from': last_retrieval, 'until': date},ignore_deleted=True)
headers = sickle.ListIdentifiers(**{'metadataPrefix': 'ddf-mxd', 'from': last_retrieval, 'until': date},ignore_deleted=True)
for header in headers:
record_total = record_total + 1
try:
r=recs.next()