Is Heritrix3.2.0 able to crawl ajax-based web sites?

457 Views Asked by T.Sh At 05 April 2015 at 15:27

Is it possible to crawl ajax-based web sites using Heritrix-3.2.0?

There are 1 best solutions below

Nytux On 07 April 2015 at 14:14 BEST ANSWER

If you intend to make a "copy" of an ajax website, clearly no.

If you want to grab some data by analysing the content of the website, you can customize the crawler with an Extractor that would determine which URLs to follow. On most website you can easily guess the urls that are interesting for your case without having to interpret the javascript. Then the ajax callbacks would be crawled and given to the Processor chain. By default this would store the ajax callback answers in the archive files.

Making your own Extractor looks like that:

    import org.archive.modules.extractor.ContentExtractor;
    import org.archive.modules.extractor.LinkContext;
    import org.archive.modules.extractor.Hop;
    import org.archive.io.ReplayCharSequence;
    import org.archive.modules.CrawlURI;

    public class MyExtractor extends ContentExtractor {
    @Override
    protected boolean shouldExtract(CrawlURI uri) {
        return true;
    }

    @Override
    protected boolean innerExtract(CrawlURI curi) {
        try {
            ReplayCharSequence cs = curi.getRecorder().getContentReplayCharSequence();
            // ... analyse the page content cs as a CharSequence ...

            // decide you want to crawl some page with url [uri] :
            addOutlink( curi, uri, LinkContext.NAVLINK_MISC, Hop.NAVLINK );
    }

Compile, put the jar file in the heritrix/lib directory and insert a bean refering to MyExtractor in the fetchProcessors chain : basically, duplicate the extractorHtml line in the crawl job cxml file.

Is Heritrix3.2.0 able to crawl ajax-based web sites?

There are 1 best solutions below

Related Questions in JAVA

Related Questions in WEB-CRAWLER

Related Questions in HERITRIX

Trending Questions

Popular # Hahtags

Popular Questions