I have configured apache Nutch 2.3.1 with complete Hadoop/Hbase ecosystem. I want that my crawler should give more preference to those domains that are given in seed in each iteration. According to my testing; It can go complete in either direction i.e. select all urls from outlinks or vise versa. Lets say, I want 40% selected URLs should be from outlinks (other than given in seed) and 60% URLs should belong to domains that are given in seed. Is it possible and how?
I think it is generator step that is causing this behaviour.
First that for the 60%, 40% ratio Nutch doesn't offer any built in mechanism. That being said I think that a lot of what is in this answer (https://stackoverflow.com/a/49240868/1977773) applies here.
The generator will sort the URLs by score and then gather the top n URLs for the next cycle. One way would be to add a high score initially in the seed file (https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L80) so that when the URLs are added to your crawldb this higher score will propagate to the outlinks, so when the next cycle comes, you'll have some "preference" for those outlinks coming from your seed file.
This could add a bit of noise, considering that you could have an outlink on your seed URL pointing to a different domain, and the score will be propagated as well. This could be fixed with a custom attribute in the seed file, and a custom scoring filter, so that you give a higher score to those links within the same domain.
But if you really want to achieve the 60/40 ratio (or something deterministic) I think that the way to go is to use a custom generator when you have total control on which URLs to crawl.