I'm working with Heritrix and I'm a bit stuck with managing its output.
I'm studying PageRank and I need Heritrix to generate a file against which to apply the ranking algorithm. The file that I need shall have only links and outlinks for each visited page.
I would like to avoid (as much as I can) postprocessing. Is it possible to customize Heritrix's output by specifying what shall be included and what shall not? I have alredy tried to modify cxml File but there are still a lot of unhelpful information in the output (like the content page).
It's not possible to directly do what you're describing without writing code. If you're up for writing code, you can write a pretty simple processor, or a ScriptedProcessor, that dumps CrawlURI.getOutLinks() in whatever format you prefer.
But I would recommend postprocessing. I'm not sure why you want to avoid it. You could use the "warcfilter" tool from https://github.com/internetarchive/warctools. Run "warcfilter --type metadata" to filter out only the metadata records, which contain the lists of outlinks. You could cut it down further with grep.
Inlinks are a much bigger question. You would have to search through the outlinks from all your warcs to get the inlinks to any given url.