Apache Nutch not crawling all websites in in-links

439 Views Asked by At

I have configured Apache Nutch 2.3.1 with Hadoop/Hbase ecosystem. Following are the configuration information.

<configuration>

<property>
  <name>db.score.link.internal</name>
  <value>5.0</value>
</property>

<property>
  <name>enable.domain.check</name>
  <value>true</value>
</property>

<property>
  <name>http.timeout</name>
  <value>30000</value>
</property>

<property>
  <name>generate.max.count</name>
  <value>200</value>
</property>

<property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
</property>


<property>
    <name>http.agent.name</name>
    <value>My Private Spider Bot</value>
</property>

<property>
    <name>http.robots.agents</name>
    <value>My Private Spider Bot</value>
    </property>
<property>
        <name>plugin.includes</name>
    <value>protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>    
</property>

</configuration>

There are 3 compute nodes where Nutch job runs. Now the problem is that after using 5000 domains as starting seed, nutch is only fetching few domains and there are alot of new domains as well where only a single document is fetched. I want nutch should fairley fetch all domains. Also I have give a score of 5 to inlinks but my tweeking shows that there is no impact of this property at all.

I have post process crawled data and found that there are total 14000 domains in database (hbase) and out of these, more than 50% domains are not crawled by Nutch ( their documents have fetch status code 0x01 ). Why it so. How to change nutch to consider new domains as well i.e., it should be fair to all domains somehow for fetching.

1

There are 1 best solutions below

0
On

How are you doing the crawling? bin/crawl has a feature to determine depth (link following). You can achieve good results by using bin/nutch with the arguments and depending your desired websites approximate total size, you should run them at least once per 3000 pages. Which means if you have 18000 pages (including link-retrieved pages) you would run it 1800/3= 6 times to get a full data.