Getting started with StormCrawler for document crawling

143 Views Asked by At

I am experiencing difficulties to get started using StormCrawler using the StormCrawler+ElasticSearch archetype. On the StormCrawler website, I see two versions, namely 1x and 2x. Similarly, Apache Storm comes in version 1 and 2.

  1. Should I install StormCrawler using the version 1x or 2x?
  2. What version of JDK does StormCrawler require? Is there a need to use Oracle JDK or can the OpenJDK be used as well?
  3. I want to use StormCrawler to identify and process images and documents. At what place in the topology can these tasks best be added?

Update: According to the following URL (Storm Crawler with Java 11), StormCrawler 2 is advised. What StormCrawler+ElasticSearch archetype should be used when using StormCrawler 2?

1

There are 1 best solutions below

0
On

SC 1.x is stable and the current version, 2.x is less tested but will be the main version at some point.

The thread mentioned in the question does not advise you to use SC2 as such, it mentions that you should use it if you need Java 11. If you are on Java 8, then you can use whichever version you want. SC works fine with openjdk.

As for question #3, it depends what you want to do. Can you please elaborate?