Apache Nutch is crawling few domain more and other less with default configuration

Question

Apache Nutch is crawling few domain more and other less with default configuration

79 Views Asked by Hafiz Muhammad Shafiq At 18 November 2025 at 08:30

I have setup Apache Nutch 1.18 on Hadoop cluster. I have given it a seed of around 10k URLs. After few time, I have run domainstats command to know the statistics of each domain. I have come to know that Nutch is crawling some websites more rigorously and only few pages of many websites. Have a look at below image

I am using most default configuration. Only generate.max is set to 500. Where is the problem ?

How can I configure Nutch to consider all domains at same level while selecting URLs
How can I configure Nutch to focus those websites that are less crawled
Also, out of 10k, Nutch has given me stats of around 3k only. How can I get stats of all seed URLs (even they are not found)

Original Q&A

There are 1 best solutions below

**Sebastian Nagel** · Answer 1

During the fetch list generation Nutch groups URLs by host name - the default for generate.count.mode, could be also by registered domain or IP. Both the total size of the fetch list and the fetch list per host/domain/IP is configurable.

If it's a requirement to include URLs from all hosts in a generate-fetch-update cycle, the total size of the fetch list (--size-fetchlist for bin/crawl or -topN for bin/nutch) should be a multiple of the number of unique host names. Eg., with 10k hosts/sites a reasonable fetch list size could be 200k. To ensure that all hosts/sites are included, set the max. size of each per-host fetch list (property generate.max.count) to the value of the multiplier, here 20.

Note that the fetch list size shouldn't be too small because there's a certain overhead running a fetch cycle (DNS lookup, robots.txt fetching and parsing, and the resources spent for the generate and update steps).

How can I configure Nutch to focus those websites that are less crawled

There is no out-of-the-box solution. Could be implemented by a scoring filter.

Also, out of 10k, Nutch has given me stats of around 3k only. How can I get stats of all seed URLs (even they are not found)?

The fetch list size of the first cycle which fetches the seeds should be at least the size of the seed list.

Apache Nutch is crawling few domain more and other less with default configuration

There are 1 best solutions below

Related Questions in WEB-CRAWLER

Related Questions in NUTCH

Related Questions in NUTCH2

Trending Questions

Popular # Hahtags

Popular Questions