We need to crawl a large number (~1.5 billion) of web pages every two weeks. Speed, hence cost, is a huge factor for us as our initial attempts have ended up costing us over $20k.
Is there any data on which crawler performs the best in a distributed environment?
For a comparison between Nutch and StormCrawler, see my article on dzone.
Heritrix can be used in distributed mode but the documentation is not very clear on how to do this. The previous 2 rely on well-established platforms for the distribution of the computation (Apache Hadoop and Apache Storm respectively), but this is not the case for Heritrix.
Heritrix is also used mostly by the archiving community, whereas Nutch and StormCrawler are used for a wider number of use cases (e.g. indexing, scraping) and have more resources for extracting data.
I am not familiar with the 2 hosted services you mention as I use only open source software.