Let's say there is a website abc.com and we crawl abc.com for 100 pages as below.
Day 1: create a crawl job in heritrix by specifying maxDocumentsToDownload as 100 Day 2: clone the above job in heritrix and run.
If website doesn't change over two days of time, will I be getting same 100 pages or different set of 100 pages?
In case any more information is required please let me know
Thanks, Hareesh
After cloning the job on 2nd day it will basically download same set of pages unless the website(webpages) is updated. On the other hand while running a job Heritrix tries its best not to crawl same page twice. Because abc.com and abc.com/index might point to same webp