How to write a cron job for Heritrix3 web crawling?

177 Views Asked by 莫绮静 At 17 May 2017 at 08:34

I build a job to crawl web data by Heritrix3.0. But it must run Heritrix.java as Java application and then the server was built. And I have to open the browser to type https://localhost:8443 to build my job, then launch the job. Then unpause the job. How can I make a cron job for web crawling automatically? Please use Java language.

Original Q&A

There are 1 best solutions below

Du-Lacoste On 06 May 2023 at 03:14

I have this automated for my FYP. You can use Java but still according to Heritrix documentation the calls will be CURLs hence best, easiest and fastest would be to use Shell Scripts to invoke the CURL and get the task done.

Get Current Status of Engine:

curl -v -k -u admin:admin --anyauth --location -H "Accept: application/xml"
˓→https://localhost:8443/engine

Create new job for crawling in the Engine:

curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --
˓→location \
-H "Accept: application/xml" https://localhost:8443/engine

Build the Job:

curl -v -d "action=build" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine/job/myjob

Launch the Job:

curl -v -d "action=rescan" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine

How to write a cron job for Heritrix3 web crawling?

There are 1 best solutions below

Related Questions in JAVA

Related Questions in WEB-CRAWLER

Related Questions in HERITRIX

Trending Questions

Popular # Hahtags

Popular Questions