Java & Heritrix 3.1.x: Web Content parsing?

572 Views Asked by 9codeMan9 At 19 July 2013 at 15:54

Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant class (or classes) of the system that deal with the actual web page content extraction?

What I want to do is obtain the content of a web page Heritrix is about to crawl and then apply a classifier to the web page's content? (analyze structural features, etc.) I think this functionality may be distributed among the ContentExtractor class and its many subclasses, but what I'm trying to do is locate the point where I have either the web page content in its entirety or in a readable/parse-able stream. Where is the content (the html) that Heritrix applies regular expressions to (in order to find links, certain file types, etc.)?

Original Q&A

There are 1 best solutions below

Nielsvh On 22 July 2013 at 22:12

I suggest looking into a custom WriterProcessor I wrote a custom MirrorWriter that looks at the incoming data, and writes files to different locations as they come it for later post-processing. The code for the MirrorWriter class is rather straight forward and well commented. The documentation is here: http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/writer/MirrorWriterProcessor.html

If you are dead set on pre-processing, you can work with extending the org.archive.modules.extractor.ExtractorHTML and do a on-the-fly version. http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/extractor/ExtractorHTML.html

Java & Heritrix 3.1.x: Web Content parsing?

There are 1 best solutions below

Related Questions in JAVA

Related Questions in WEB-CRAWLER

Related Questions in HTML

Related Questions in DOCUMENT-CLASSIFICATION

Related Questions in HERITRIX

Trending Questions

Popular # Hahtags

Popular Questions