Apache Nutch is crawling few domain more and other less with default configuration

77 Views Asked by At

I have setup Apache Nutch 1.18 on Hadoop cluster. I have given it a seed of around 10k URLs. After few time, I have run domainstats command to know the statistics of each domain. I have come to know that Nutch is crawling some websites more rigorously and only few pages of many websites. Have a look at below image enter image description here

I am using most default configuration. Only generate.max is set to 500. Where is the problem ?

  1. How can I configure Nutch to consider all domains at same level while selecting URLs
  2. How can I configure Nutch to focus those websites that are less crawled
  3. Also, out of 10k, Nutch has given me stats of around 3k only. How can I get stats of all seed URLs (even they are not found)
1

There are 1 best solutions below

0
On

During the fetch list generation Nutch groups URLs by host name - the default for generate.count.mode, could be also by registered domain or IP. Both the total size of the fetch list and the fetch list per host/domain/IP is configurable.

If it's a requirement to include URLs from all hosts in a generate-fetch-update cycle, the total size of the fetch list (--size-fetchlist for bin/crawl or -topN for bin/nutch) should be a multiple of the number of unique host names. Eg., with 10k hosts/sites a reasonable fetch list size could be 200k. To ensure that all hosts/sites are included, set the max. size of each per-host fetch list (property generate.max.count) to the value of the multiplier, here 20.

Note that the fetch list size shouldn't be too small because there's a certain overhead running a fetch cycle (DNS lookup, robots.txt fetching and parsing, and the resources spent for the generate and update steps).

  1. How can I configure Nutch to focus those websites that are less crawled

There is no out-of-the-box solution. Could be implemented by a scoring filter.

  1. Also, out of 10k, Nutch has given me stats of around 3k only. How can I get stats of all seed URLs (even they are not found)?

The fetch list size of the first cycle which fetches the seeds should be at least the size of the seed list.