I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules
I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded, however not following any links to off-site pages - only the assets for the current page/domain.
For example, CDN content required for the rendering of a page might be hosted on an external domain (maybe AWS or Cloudflare), so I would need to download that content, as well as following all on-domain links, however not follow any links to pages outside of the scope of the current domain.
You could use 3 decide rules:
ContentTypeNotMatchesRegexDecideRule
;So something like that: