Got java heap size error when trying to cluster 15980 documents via carrot2workbench

Question

Got java heap size error when trying to cluster 15980 documents via carrot2workbench

118 Views Asked by user1337 At 28 July 2025 at 08:59

My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0

My Solr Index: 15980 documents

My Problem: Cluster all documents with the kmeans algorithm

When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000 Results. I started Solr with -Xms256m -Xmx6g but it still occurs.

Is it really a heap size problem or could it be somewhere else?

Original Q&A

There are 2 best solutions below

Has QUIT--Anony-Mousse On 05 May 2015 at 16:37

That seems as if Carrot uses much to much memory.

K-means doesn't need a whole lot of memory - one integer per document.

So you should be able to run k-means on millions of documents in memory; even with the document vectors in memory.

16k documents is not a lot, so I don't see why you should run into trouble with a good implementation yet. Seems they really want you to buy the commercial version to make a living! Going Mahout seems like overkill to me. Your data still fits into main memory, I guess, so don't waste time on distributing it over the network which is a million times slower than your memory.

Maybe implement k-means yourself. It's not difficult...

**tkja** · Accepted Answer

Your suspicion is correct, it is a heap size problem, or more precisely, a scalability constraint. Straight from the carrot2 FAQs: http://project.carrot2.org/faq.html#scalability

How does Carrot2 clustering scale with respect to the number and length of documents? The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.

A developer also posted about this here: https://stackoverflow.com/a/28991477

While the developers recommend Mahout, and this is probably the way to go since you would not be bound by the in-memory clustering constraints as in carrot2, there might be other possibilities, though:

If you really like carrot2 but do not necessarily need k-means, you could take a look at the commercial Lingo3G, based on the "Time of clustering 100000 snippets [s] " field and the (***) remark on http://carrotsearch.com/lingo3g-comparison it should be able to tackle more documents. Check also their FAQ entry on "What is the maximum number of documents Lingo3G can cluster?" on http://carrotsearch.com/lingo3g-faq
Try to minimize the size of your labels on which k-means is performing the clustering. Instead of clustering over all the documents content, try to cluster on the abstract/summary or extract important keywords and cluster on them.

Got java heap size error when trying to cluster 15980 documents via carrot2workbench

There are 2 best solutions below

Related Questions in SOLR

Related Questions in CLUSTER-ANALYSIS

Related Questions in K-MEANS

Related Questions in WORKBENCH

Related Questions in CARROT

Trending Questions

Popular # Hahtags

Popular Questions