Nutch http.redirect.max may I know what does it Mean

91 Views Asked by At

I am crawling for example 1000 websites.when I readdb for some websites it is showing db_redirect_temp and db_redirect_moved if I set http.redirect.max=10 is this value for each website or it treat only 10 redirects for entire crawling websites.

1

There are 1 best solutions below

4
On

http.redirect.max is defined as:

The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching.

The number applies to the redirects of a single web page. 10 is a really generous limit, 3 should be enough in most cases given that the redirect target will be tried in one of the later fetch cycles anyway. Note that the redirect source is always recorded in the CrawlDb as db_redir_perm or db_redir_temp.