Harvesters using DCAT extension get stucked

539 Views Asked by At

We've been using ckanext-dcat to harvest from remote json sources, sometimes some harvest jobs didn't finish and had to be deleted along with all the datasets from that source, which is not very convinient but then all goes back to normal, I don't know if there is a way to delete just a single job.

But now I get this in gather consumer log:

    Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 9, in <module>
    load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 104, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 143, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 238, in run
    result = self.command()
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 129, in command
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 219, in gather_callback
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters.py", line 186, in gather_stage
    content = self._get_content(url, harvest_job, page)
  File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters.py", line 66, in _get_content
    cl = r.headers['content-length']
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-length

The job finishes but no datasets get created, if I delete the job and reharvest it keeps running but never ends and other harvest jobs don't update either.

How can I fix this?

1

There are 1 best solutions below

1
On BEST ANSWER

@Urkonn, different things going on here:

  • Harvester getting stuck: this might be caused by a buggy implementation on the harvester, triggered by a specific format or field in the files that you are harvesting. It's hard to debug without knowing more, can you PM me a link to one of the files that causes the harvester to hang, or what the logs say when this happens?

  • Clearing a source without deleting the datasets: I totally understand that removing all datasets seems overkill, but if we clear the jobs, objects, etc from a source then the existing datasets will lose the link to the source, which for instance will mean that they are not listed on the source page. Also new jobs won't have a way of knowing that the dataset has already being harvested for this source, so it will create a duplicate even if the dataset already exists. Maybe there is a way to prevent this but I'd say that recreating the datasets is safer.

  • KeyError: 'content-length': this is caused by upgrading to requests 2.3. I've pushed a fix to ckanext-dcat to prevent this [1], so please pull the latest version to get the patch and restart all harvest processes.

[1] https://github.com/ckan/ckanext-dcat/commit/ed186623d83cf3baf9dd29bdb13be7f1431b8ab8