Currently I try to execute k-means clustering from mlpack, a scalable machine learning library.
But when I execute bin/kmeans
at the command line, I always receive the error.
error: arma::memory::acquire(): out of memory
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
The size of the input file is 7.4 GB.
Do you have any suggestion? Do you know alternative tools which can be applicable to the large data set?
There is not really an easy solution here for an exact answer. The issue is that Armadillo (the underlying matrix library) is not able to allocate enough space for your input data.
For the most part mlpack is more conservative with RAM than other tools such as MATLAB or R, but it sounds like your dataset is large enough that your options (short of getting a system with more RAM like Kerrek suggested) are limited.
Many strategies for accelerating k-means involve sampling the input dataset and running k-means on a subset of the input points. Because k-means is very sensitive to the initial centroids it is given, this sampling strategy is often used to choose initial centroids. See Bradley and Fayyad, 1998: ftp://www.ece.lsu.edu/pub/aravena/ee7000FDI/Presentations/Clustering-Pallavi/Ref4_k-means.pdf
In your case, maybe it is easier and sufficiently accurate to just run k-means on a subset of your data that can be loaded in RAM and take those as your cluster centroids. If k is reasonably small and the number of points in your dataset is large (that is, the number of clusters is much much smaller than the number of points), then this should be a reasonable approach, and certainly a simpler approach than modifying the mlpack code to be even more conservative with RAM or writing your own program to use mmap() or something, and cheaper than buying new RAM.