Issues retrieving all records from institutional OAI-PMH repository using Sickle

270 Views Asked by At

I have been working on retrieving all records from a OAI-PHM repository from various research institutions using the api function in Sickle program in Python. I have written a code that performs a consecutive harvesting that iterates over the records of the repository and saves the records as an XML-file as well as into a SQL-data.

However for some reason I am unable to retrieve all the records in the repository - there are missing records particularly between the years 2017-2020. If I perform a selective harvesting by date using the "from parameter" in the Sickle program I am able to retrieve some additional records, but not all of them.

I suspect the issue is due to the fact that some of the records in the OAI repository are empty and that Sickle stops harvesting records when encountering a record that contains no information.

I have set the optional parameter "ignore_deleted" to True in the code in order to skip deleted records. However I am unsure if its possible to add an additional parameter that skips a record that is empty?

Below is an excerpt of the code that specifies the consecutive harvesting of the OAI repository.

import datetime
from sickle import Sickle

api_list = [ \
"https://pure.itu.dk/ws/oai", \
]

date="2020-08.01"
last_retrieval="1950.01.01"


for api in api_list:
    institution = ""
    institution = inst_institution(api)
    record_total=0

    sickle = Sickle(api)

    harvest_id = uuid.uuid4() # generating a random ID for the record. 

    recs = sickle.ListRecords(**{'metadataPrefix': 'ddf-mxd', 'from': last_retrieval, 'until': date},ignore_deleted=True)
    headers = sickle.ListIdentifiers(**{'metadataPrefix': 'ddf-mxd', 'from': last_retrieval, 'until': date},ignore_deleted=True)
    for header in headers:
        record_total = record_total + 1
        try:    
            r=recs.next()
0

There are 0 best solutions below