I've build a multi threaded web crawling & extraction engine with plain Java and selenium. Each job from an API gets executed in an own thread and commits its state back to the API. Each job can also contain extraction-information(XPth, regex, CSS-selectors), connection-information(proxy credentials) and hooks for the crawling-engine. For example to click on a button before saving the result. This engine works great but now i want to run it on multiple machines parallel. I could do this with the current version (has channel support) but I'm looking for improvements and technologies to make the whole thing still better and learn something new.
I found Akka.io, Apache Spark, Apache Mesos and Apache Storm and ask myself weather one of these frameworks could be a technology i should investigate more time and rebuild my engine on it.
Actually i don't understand all differences and advantages of the frameworks but that's why I'm asking. They seem to be similar.
Is my intend to build a crawling engine with one of these frameworks possible? Would someone suggest the use of a framework? Why or why not?
I previously helped build a rendering web crawler as an example/tutorial app for Apache Mesos. It's certainly not as complex as what you're building, but it might provide a good architectural reference. You can check it out at https://github.com/mesosphere/rendler
Mesos provides a lot of distributed systems for launching tasks, monitoring/sending status, communicating between tasks/scheduler, persistent state, failover, etc. Sometimes we like to refer to Mesos as a Distributed Systems SDK. http://mesosphere.github.io/presentations/mesoscon-2014/