Crawling using Storm Crawler

1.7k Views Asked by Ravi Ranjan At 28 December 2016 at 09:29

We are trying to implement Storm Crawler to crawl data. We have been able to find sub-links from an url but we want to get contents from those sublinks. I have not been able find much resources which would guide me how to get it? Any useful links/websites in this regard would be helpful. Thanks.

Original Q&A

There are 1 best solutions below

Julien Nioche On 28 December 2016 at 13:54

Getting Started, presentations and talks, as well as the various blog posts should be useful.

If the sublinks are fetched and parsed - which you can check in the logs, then the content will be available for indexing or storing e.g as WARC. There is a dummy indexer which dumps the content to the console which can be taken as a starting point, alternatively there are resources for indexing the documents in Elasticsearch or SOLR. The WARC module can be used to store the content of pages as well.

Crawling using Storm Crawler

There are 1 best solutions below

Related Questions in WEB-CRAWLER

Related Questions in APACHE-STORM

Related Questions in STORMCRAWLER

Trending Questions

Popular # Hahtags

Popular Questions