Is Heritrix Crawl Deterministic?

134 Views Asked by TechyHarry At 03 February 2016 at 07:43

Let's say there is a website abc.com and we crawl abc.com for 100 pages as below.

Day 1: create a crawl job in heritrix by specifying maxDocumentsToDownload as 100 Day 2: clone the above job in heritrix and run.

If website doesn't change over two days of time, will I be getting same 100 pages or different set of 100 pages?

In case any more information is required please let me know

Thanks, Hareesh

Original Q&A

There are 1 best solutions below

Girish Mane On 03 February 2016 at 13:30

After cloning the job on 2nd day it will basically download same set of pages unless the website(webpages) is updated. On the other hand while running a job Heritrix tries its best not to crawl same page twice. Because abc.com and abc.com/index might point to same webp

Is Heritrix Crawl Deterministic?

There are 1 best solutions below

Related Questions in WEB-CRAWLER

Related Questions in HERITRIX

Trending Questions

Popular # Hahtags

Popular Questions