Heritrix single-site scrape, including required off-site assets

Question

Heritrix single-site scrape, including required off-site assets

776 Views Asked by Karl M.W. At 26 May 2015 at 15:49

I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules

I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded, however not following any links to off-site pages - only the assets for the current page/domain.

For example, CDN content required for the rendering of a page might be hosted on an external domain (maybe AWS or Cloudflare), so I would need to download that content, as well as following all on-domain links, however not follow any links to pages outside of the scope of the current domain.

Original Q&A

There are 2 best solutions below

Erik Melkersson On 14 February 2023 at 08:05

I asked a related question in Crawling rules in heritrix, how to load embedded content? and came up with a solution there. Later I found this post as well. I am submitting my solution here as well:

Note: I know the question is old so it was most likely made for an older heritrix version. I am using 3.4

 <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
  <property name="rules">
   <list>
     <bean class="org.archive.modules.deciderules.AcceptDecideRule" />
     <bean class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
       <property name="decision" value="REJECT"/>
       <property name="regexList">
         <list>
           <value>.*site\.domain/path/.*</value>
         </list>
       </property>
    </bean>
     
    <bean class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
      <property name="decision" value="ACCEPT"/>
      <property name="regex" value="(E|X)" />
    </bean>
     
     <!-- Below are some of the "standard" rules set up on a fresh job, it behaves the same with and without them when it comes to not loading embedded stuff -->
    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
     <!-- <property name="maxHops" value="20" /> -->
    </bean>
    <!-- ...and REJECT those with suspicious repeating path-segments... -->
    <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
     <!-- <property name="maxRepetitions" value="2" /> -->
    </bean>
    <!-- ...and REJECT those with more than threshold number of path-segments... -->
    <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
     <!-- <property name="maxPathDepth" value="20" /> -->
    </bean>
    <!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
    <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
    </bean>
    <!-- ...but always REJECT those with unsupported URI schemes -->
    <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
    </bean>
    
   </list>
  </property>
</bean>

Adjust <value>.*site\.domain/path/.*</value> to match you site, and path if any.

You can also adjust <property name="regex" value="(E|X)" /> where E|X can be just E if you just want the known included things in the page, like images, css etc. X is a bit experimental for trying things found in javascript files as well.

**Nytux** · Accepted Answer · 2015-05-27T13:25:27.673000

You could use 3 decide rules:

The first one accepts all non-html pages, using a ContentTypeNotMatchesRegexDecideRule;
The second one accepts all urls in the current domain.
The third one rejects all pages not in the domain and not directly reached from the domain (the alsoCheckVia option)

So something like that:

<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
 <property name="rules">
  <list>
   <!-- Begin by REJECTing all... -->
   <bean class="org.archive.modules.deciderules.RejectDecideRule" />

   <bean class="org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule">
    <property name="decision" value="ACCEPT"/>
    <property name="regex" value="(?i)html|wml"/>
   </bean>
   <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
    <property name="decision" value="ACCEPT"/>
    <property name="surtsSource">
     <bean class="org.archive.spring.ConfigString">
      <property name="value">
       <value>
        http://(org,yoursite,
       </value>
      </property> 
     </bean>
    </property>
   </bean>
   <bean class="org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="alsoCheckVia" value="true"/>
    <property name="surtsSource">
     <bean class="org.archive.spring.ConfigString">
      <property name="value">
       <value>
        http://(org,yoursite,
       </value>
      </property> 
     </bean>
    </property>
   </bean>
  </list>
 </property>
</bean>

Heritrix single-site scrape, including required off-site assets

There are 2 best solutions below

Related Questions in JAVA

Related Questions in WEB-CRAWLER

Related Questions in HERITRIX

Trending Questions

Popular # Hahtags

Popular Questions