Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode

3.6k Views Asked by Anakin At 10 October 2017 at 18:41

We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k.

Is there any data on which crawler performs the best in a distributed environment?

Original Q&A

There are 2 best solutions below

Julien Nioche On 10 October 2017 at 21:52

For a comparison between Nutch and StormCrawler, see my article on dzone.

Heritrix can be used in distributed mode but the documentation is not very clear on how to do this. The previous 2 rely on well-established platforms for the distribution of the computation (Apache Hadoop and Apache Storm respectively), but this is not the case for Heritrix.

Heritrix is also used mostly by the archiving community, whereas Nutch and StormCrawler are used for a wider number of use cases (e.g. indexing, scraping) and have more resources for extracting data.

I am not familiar with the 2 hosted services you mention as I use only open source software.

Sunil Kumbhar On 11 April 2018 at 14:39

~~We've only tried nutch, stormcrawler and mixnode. We eventually used mixnode to crawl ~300 million pages across 5k domains.~~

My $0.02: mixnode is the better choice for larger scale crawling (aka over 1 million urls). For smaller crawls it's an overkill since you would have to parse the resulting warc files and if you're doing only a few thousand pages it's just easier to run your own script or use an open source alternative like nutch or stormcrawler (or even scrapy).

Mixnode is now an "alternative" to web crawling, so it's a completely different product from my old answer.

Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode

There are 2 best solutions below

Related Questions in WEB-CRAWLER

Related Questions in NUTCH

Related Questions in HERITRIX

Related Questions in STORMCRAWLER

Trending Questions

Popular # Hahtags

Popular Questions