I'm attempting to classify some inputs (text classification: 10,000+ examples, and 100,000+ features)
And I've read that using LibLinear is far faster / more memory efficient for such tasks, as such, I've ported my LibSvm classifier to accord/net, like so:
//SVM Settings
var teacher = new MulticlassSupportVectorLearning<Linear, Sparse<double>>()
{
//Using LIBLINEAR's L2-loss SVC dual for each SVM
Learner = (p) => new LinearDualCoordinateDescent<Linear, Sparse<double>>()
{
Loss = Loss.L2,
Complexity = 1,
}
};
var inputs = allTerms.Select(t => new Sparse<double>(t.Sentence.Select(s => s.Index).ToArray(), t.Sentence.Select(s => (double)s.Value).ToArray())).ToArray();
var classes = allTerms.Select(t => t.Class).ToArray();
//Train the model
var model = teacher.Learn(inputs, classes);
At the point of .Learn() - I get an instant OutOfMemoryExcpetion.
I've seen there's a CacheSize setting in the documentation, however, I cannot find where I can lower this setting, as is show in many examples.
One possible reason - I'm using the 'Hash trick' instead of indices - is Accord.Net attempting to allocate an array of the full hash space? (probably close to int.MaxValue) if so - is there any way to avoid this?
Any help is most appreciated!
Allocating hash space of 10000+ documents with 100000+ features will take at least 4 GB of memory, which may be limited by the AppDomain memory limit and CLR object size limit. Many projects by default are prefered to be built under 32-bit platform, which does not allow allocation of objects more than 2GB. I've managed to overcome this by removing 32-bit platform prefernce (go to project properties -> build and uncheck "Prefer 32-bit"). After that we should allow creation of objects more taking more than 2 GB or memory, add this line to your configuration file
Be aware that if you add this line but leave the 32-bit platform build preference you will still get an exception, as your project will not be able to allocate an array of such size
This is how you tune the CacheSize
This way of constructing an SVM does cope with
Sparse<double>data structure, but it is not using LibLinear. If you open Accord.NET repository and look at SVM solving algorithms with LibLinear support (LinearCoordinateDescent, LinearNewtonMethod) you will see no CacheSize property.