For a data mining competition I am building a model for churn prediction. I have a training dataset with labels and a test dataset without. To build my model, I applied some filters to preprocess the training dataset. I searched and removed the outliers and extreme values using the InterquartileRange
, RemoveWithValues
and RemoveAttributes
filter (because the InterquartileRange
creates new attributes for outliers and extreme values).
I know that it is necessary for Weka that the supplied test set and training set have the same filters, but I need all instances from my test set to see the predicted score. Therefore, I cannot apply the RemoveWithValues
filter. Due to this, I get the "Test and training set are not compatible". Can this problem be solved? In summary, I want to get scores for all instances of my test set with a model built on a training set without extreme values and outliers.
You seem to misunderstand the requirements for training and test sets. You don't need to apply the same filters for test and training set, at least in the sense that you seem to think. However, you must apply the same transformation.
Training set and test set must be compatible, i.e., the must have the same features of the same name and the same type. (Theoretically, it would be possible for the test set to have more features, but I don't know how Weka handles this.) Let's call this syntactically compatible.
This can usually be accomplished by applying the same filters, but it doesn't have to be. For example, if you apply a filter that removes instances to the training set, then the "format" of the dataset is not changed and you don't need to apply that on the test set, too.
However, applying the same filter means you have to train the filter on the training set and then apply it on the test set, otherwise you can end up with two datasets that are syntactically compatible (and Weka won't complain), but are not semantically compatible. For example, assume you have a training and a test dataset with a numerical feature foo:
Normalize
to the training set (scale to range [0,1])Normalize
filter to the test set, it would then be: 0.0, 0.4, 1.0So in your case, you must've done something that changed the format of the test and the training set differently. (If they are not too long, you could post them in your question.)
Note: I was confused by
but it turns out it was just a lack of knowledge on my side. "Extreme values" are also instances in Weka-speak, so no problem there. They seem to be data points that are not outliers, but so extreme that they are too influential on the learned model and generalization. (Source)
I'll leave this here just for the sake of information.