How to divide svm_problem into 5 folds for custom cross validation - LIBSVM

321 Views Asked by At

I am attempting to implement my own cross-validation function for LIBSVM, however I am confused on how to process the data structures that have been provided to me based on my input data.

The data is stored in a structure svm_problem:

public class svm_problem implements java.io.Serializable
{
    public int l;
    public double[] y;
    public svm_node[][] x;
}

Where: l is the length of the data set; y is the array containing their target values; x is an array of pointers, each of which point to a representation of one training vector.

svm_node is defined as:

public class svm_node implements java.io.Serializable
{
    public int index;
    public double value;
}

My goal is to split the training data into 5 folds, use 4 of them for training (function svm_train), and use the rest one to test the result (svm_predict) in order to find which value of C has the best prediction result (Based off error function).

My problem is how to separate the data into 5 folds given the structure of the data. How can the data structures be properly divided into 5 folds in order for me to proceed with the optimization of C.

I have been using this as a guide: A Practical Guide to Support Vector Classification

If someone could provide an example or a link to an example on how this is best done it would be greatly appreciated. Thanks.

1

There are 1 best solutions below

2
On

The svm_problem describes, for i = 0, 1, ..., l - 1, that f(x[i]) should approximately equal y[i] for the learned function f. Each tuple (x[i], y[i]) can be thought of as a noisy sample from the function f that you are trying to find.

To split your dataset into training, cross validation, and testing datasets, you can simply split the set {0, 1, ..., l - 1} randomly into those 3 parts. This is typically done by shuffling the list of numbers 0, 1, ..., l - 1 then saying "the first 60% of the numbers are training, the next 20% are cross validation, the next 20% are testing," or something similar to that. For each of those subset of the dataset, you can construct a new svm_problem that describes just that portion of the data.