Is Heritrix Crawl Deterministic?

116 Views Asked by At

Let's say there is a website abc.com and we crawl abc.com for 100 pages as below.

Day 1: create a crawl job in heritrix by specifying maxDocumentsToDownload as 100 Day 2: clone the above job in heritrix and run.

If website doesn't change over two days of time, will I be getting same 100 pages or different set of 100 pages?

In case any more information is required please let me know

Thanks, Hareesh

1

There are 1 best solutions below

2
On

After cloning the job on 2nd day it will basically download same set of pages unless the website(webpages) is updated. On the other hand while running a job Heritrix tries its best not to crawl same page twice. Because abc.com and abc.com/index might point to same webp