I have setup Apache Nutch 1.18 on Hadoop cluster. I have given it a seed of around 10k URLs. After few time, I have run domainstats command to know the statistics of each domain. I have come to know that Nutch is crawling some websites more rigorously and only few pages of many websites. Have a look at below image
I am using most default configuration. Only generate.max is set to 500. Where is the problem ?
- How can I configure Nutch to consider all domains at same level while selecting URLs
- How can I configure Nutch to focus those websites that are less crawled
- Also, out of 10k, Nutch has given me stats of around 3k only. How can I get stats of all seed URLs (even they are not found)
During the fetch list generation Nutch groups URLs by host name - the default for
generate.count.mode
, could be also by registered domain or IP. Both the total size of the fetch list and the fetch list per host/domain/IP is configurable.If it's a requirement to include URLs from all hosts in a generate-fetch-update cycle, the total size of the fetch list (
--size-fetchlist
for bin/crawl or-topN
for bin/nutch) should be a multiple of the number of unique host names. Eg., with 10k hosts/sites a reasonable fetch list size could be 200k. To ensure that all hosts/sites are included, set the max. size of each per-host fetch list (propertygenerate.max.count
) to the value of the multiplier, here 20.Note that the fetch list size shouldn't be too small because there's a certain overhead running a fetch cycle (DNS lookup, robots.txt fetching and parsing, and the resources spent for the generate and update steps).
There is no out-of-the-box solution. Could be implemented by a scoring filter.
The fetch list size of the first cycle which fetches the seeds should be at least the size of the seed list.