I am attempting to implement my own cross-validation function for LIBSVM, however I am confused on how to process the data structures that have been provided to me based on my input data.
The data is stored in a structure svm_problem:
public class svm_problem implements java.io.Serializable
{
public int l;
public double[] y;
public svm_node[][] x;
}
Where: l is the length of the data set; y is the array containing their target values; x is an array of pointers, each of which point to a representation of one training vector.
svm_node is defined as:
public class svm_node implements java.io.Serializable
{
public int index;
public double value;
}
My goal is to split the training data into 5 folds, use 4 of them for training (function svm_train), and use the rest one to test the result (svm_predict) in order to find which value of C has the best prediction result (Based off error function).
My problem is how to separate the data into 5 folds given the structure of the data. How can the data structures be properly divided into 5 folds in order for me to proceed with the optimization of C.
I have been using this as a guide: A Practical Guide to Support Vector Classification
If someone could provide an example or a link to an example on how this is best done it would be greatly appreciated. Thanks.
The
svm_problem
describes, fori = 0, 1, ..., l - 1
, thatf(x[i])
should approximately equaly[i]
for the learned functionf
. Each tuple(x[i], y[i])
can be thought of as a noisy sample from the functionf
that you are trying to find.To split your dataset into training, cross validation, and testing datasets, you can simply split the set
{0, 1, ..., l - 1}
randomly into those 3 parts. This is typically done by shuffling the list of numbers0, 1, ..., l - 1
then saying "the first 60% of the numbers are training, the next 20% are cross validation, the next 20% are testing," or something similar to that. For each of those subset of the dataset, you can construct a newsvm_problem
that describes just that portion of the data.