How do we know when Heritrix completes a crawl job?

292 Views Asked by At

In our application, Heritrix is being used as the crawl engine and once the crawl job is finished, we are manually kicking off an endpoint to download the PDFs from a website. We would like to automate this downloading pdf task as soon as the crawl job is complete. Does HEritrix provide any URI/webservice method - which returns the status of the job? (or) Do we need to create a polling app to continuously monitor the status of the job?

1

There are 1 best solutions below

1
On

I don't know if there is any option to do it without continious monitoring but you can use Heritrix API to get status for a job, smth like

curl -v -d "action=" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

gives you XML from where you can read job status.

Another, maybe easier (yet not so 'professional') option is to check if your jobs warcs directory contains a file with .open extension. If not - the job is finished.