Nutch: What version of Nutch + Cassandra actually works?

56 Views Asked by At

I'm trying to do some crawling with Nutch and I'd like to test out Cassandra as a backend, however using the latest version of nutch and its dependencies Cassandra throws a variety of errors as you move through the inject, generate, fetch, etc. process.

The errors are all related to actual problems in code, not out of memory or configuration. I've fixed some of them by modifying code within gora-cassandra, but it's still not functional.

My question is, does a working version of these 2 projects exist? By working i mean you can run through inject, generate, fech, parse, updatedb on at least a small set of urls, without error.

Here's an example of one of the classes giving an error during fetch:

java.lang.NullPointerException at org.apache.gora.cassandra.query.CassandraSuperColumn.getUnionIndex

I have used HBase as the backend and that just works, although HBase itself is a monster to manage so that's why i'd like to test out Cassandra. However, i'm about to give up on this as I don't think I should be having to modify gora-cassandra code just to get a basic example to run.

Thanks

1

There are 1 best solutions below

0
On

According to this link it's just broken, which is about 3 months old http://lucene.472066.n3.nabble.com/Re-user-Digest-3-Jun-2017-19-27-20-0000-Issue-2758-td4339060.html

Its unclear why backends that do not work are even documented.

HBase is most widely used, followed by MongoDB... on the other end of the spectrum, Cassandra is least used and broken. It has not been maintained for quite some time... and yes this is reflected by use of Super Columns. We are currently re-writing the backend as part of a GSoC project.

I would agree with the guy making the original statement, Its unclear why backends that do not work are even documented.

Really tired of this project and its lack usable documentation.