Weka classification with removed instances in the training set

Question

Weka classification with removed instances in the training set

526 Views Asked by LanderB At 17 August 2025 at 03:28

For a data mining competition I am building a model for churn prediction. I have a training dataset with labels and a test dataset without. To build my model, I applied some filters to preprocess the training dataset. I searched and removed the outliers and extreme values using the InterquartileRange, RemoveWithValues and RemoveAttributes filter (because the InterquartileRange creates new attributes for outliers and extreme values).

I know that it is necessary for Weka that the supplied test set and training set have the same filters, but I need all instances from my test set to see the predicted score. Therefore, I cannot apply the RemoveWithValues filter. Due to this, I get the "Test and training set are not compatible". Can this problem be solved? In summary, I want to get scores for all instances of my test set with a model built on a training set without extreme values and outliers.

Original Q&A

There are 1 best solutions below

**Sentry** · Answer 1

You seem to misunderstand the requirements for training and test sets. You don't need to apply the same filters for test and training set, at least in the sense that you seem to think. However, you must apply the same transformation.

Training set and test set must be compatible, i.e., the must have the same features of the same name and the same type. (Theoretically, it would be possible for the test set to have more features, but I don't know how Weka handles this.) Let's call this syntactically compatible.

This can usually be accomplished by applying the same filters, but it doesn't have to be. For example, if you apply a filter that removes instances to the training set, then the "format" of the dataset is not changed and you don't need to apply that on the test set, too.

However, applying the same filter means you have to train the filter on the training set and then apply it on the test set, otherwise you can end up with two datasets that are syntactically compatible (and Weka won't complain), but are not semantically compatible. For example, assume you have a training and a test dataset with a numerical feature foo:

Training set has values: 0, 2, 5, 10
Test set has values 0, 2, 5
You apply Normalize to the training set (scale to range [0,1])
The filtered training not has values: 0.0, 0.2, 0.5, 1.0
Apply the same "trained" filter to the test set: 0.0, 0.2, 0.5
If you had applied a new Normalize filter to the test set, it would then be: 0.0, 0.4, 1.0

So in your case, you must've done something that changed the format of the test and the training set differently. (If they are not too long, you could post them in your question.)

Note: I was confused by

I searched and removed the outliers and extreme values

but it turns out it was just a lack of knowledge on my side. "Extreme values" are also instances in Weka-speak, so no problem there. They seem to be data points that are not outliers, but so extreme that they are too influential on the learned model and generalization. (Source)

I'll leave this here just for the sake of information.

Weka classification with removed instances in the training set

There are 1 best solutions below

Related Questions in CLASSIFICATION

Related Questions in WEKA

Related Questions in OUTLIERS

Trending Questions

Popular # Hahtags

Popular Questions