I have a training data set which is composed of 14 integer numbers separated by a blank. Each number is a 1 (one) or a 2 (two). The i-th number can be understood as the presence of the corresponding feature. One means false and two means true. The training data set looks like this:
1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 2 1 1 1 1 1 1 1
1 2 1 2 1 1 2 1 1 1 1 1 1 1
1 2 1 2 1 1 2 1 1 1 1 1 1 1
1 2 1 2 1 1 2 1 1 1 1 1 1 1
1 2 1 2 1 1 2 1 1 1 1 1 1 1
And the test data set contains 10000 lines, representing samples where some of the data are missing. This is represented by zeros, one per line. The testing data look like this:
1 1 1 1 1 1 1 1 1 1 1 1 0 1
0 2 1 2 1 1 2 1 1 1 1 1 1 1
1 2 1 0 1 1 2 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1 1 1
2 2 2 0 1 1 2 1 1 1 1 1 1 1
I am very new to machine learning, I would like to know a way to predict those missing values. I know in scikit learn there is a class call Imputer
which allows you to find those missing values. But it does not use any train data. So it would be great, if someone can give me some points to tackle this problem