How do i exclude everything but links/outlinks from a heritrix crawl?

369 Views Asked by At

I'm working with Heritrix and I'm a bit stuck with managing its output.

I'm studying PageRank and I need Heritrix to generate a file against which to apply the ranking algorithm. The file that I need shall have only links and outlinks for each visited page.

I would like to avoid (as much as I can) postprocessing. Is it possible to customize Heritrix's output by specifying what shall be included and what shall not? I have alredy tried to modify cxml File but there are still a lot of unhelpful information in the output (like the content page).

1

There are 1 best solutions below

0
On

It's not possible to directly do what you're describing without writing code. If you're up for writing code, you can write a pretty simple processor, or a ScriptedProcessor, that dumps CrawlURI.getOutLinks() in whatever format you prefer.

But I would recommend postprocessing. I'm not sure why you want to avoid it. You could use the "warcfilter" tool from https://github.com/internetarchive/warctools. Run "warcfilter --type metadata" to filter out only the metadata records, which contain the lists of outlinks. You could cut it down further with grep.

Inlinks are a much bigger question. You would have to search through the outlinks from all your warcs to get the inlinks to any given url.