sigmoid output for detection class returns incorrect performance

480 Views Asked by At

Summary of my problem: I have a detection (binary classification, unbalanced problem). I use a sigmoid to classify samples. Reported f-score, precision and recall seem to consider both classes, e.g. true positives seem to be the total number of correctly classified samples, and not total number of samples belonging to class '1' that are correctly classified.

Longer explanation: In my experiment I have demographical data about persons and I have to predict whether they bought a product or not. I used PCA to reduce the initial features to just 4 features and the data is stored in csv file (first column has the class labels, '0' and '1'). Note that most people didn't buy and then the two classes are very unbalanced. I use the CSVDataset class to read it:

dataset: &train !obj:pylearn2.datasets.csv_dataset.CSVDataset {
        path: 'input.csv',
        task: 'classification'
}

I want to start with a simple classification model and I use f-score as performance measure. Therefore, my first idea was to use a MLP model with a single sigmoid layer (default monitor 'detection' provides recall, precision, f-score):

model: !obj:pylearn2.models.mlp.MLP {
        layers: [
                 !obj:pylearn2.models.mlp.Sigmoid {
                     layer_name: 'y',
                     dim: 2,
                     irange: .005
                 }
                ],
        nvis: 4,
    }

My initial idea was to set dim to 1 (the decision rule would be: if output > 0.5 choose class '1', if < 0.5 choose class '0'). However, I got the error ValueError: Can't convert to VectorSpace of dim 1. Expected either dim=2 (merged one-hots) or 2 (concatenated one-hots) and then I decided to set dim to 2 (decision rule would be: if out1 > out0 choose '1', if out1

In my train.yaml I follow more or less the softmax example notebook provided in the documentation. For example, I use BGD algorithm and set the batch_size as the total number of examples in the training set (74164 examples, a small dataset!) just to avoid confusion when checking the performance manually.

The model was trained with the train.py script that is provided and everything seemed fine, until I had a look at the results. As mentioned earlier, it is a detection problem where the class to detect ('1') happens very rarely. Therefore, I was very surprised to see high values for the reported train_y_f1 (best result is approx. 94%, after one epoch).

To check this, I computed the f-score manually using the provided script predict_csv.py and then loading the predictions. I saw that in fact there were only misses (all '1' were classified as '0'), so precision, recall and f-score should be all zero. Why does the detection monitor report higher values?

After some investigation, I found that the MLP has an output for each class, and I verified (computed it manually and got the same numbers) that true positives and false positives defined in get_detection_channels_from_state() refer actually to both classes, '1' and '0', e.g. true positives is the number of vectors that belong to '1' classified as '1' summed to the number of vectors that belong to '0' classified as '0'. So the MLP is classifying everything as '0', and since nearly all vectors belongs to '0', the performance is good. This is a known problem for unbalanced detection problems, where correct classification rate is not a suitable measure, and it is the reason why we have measures such f-score or AUC. However, if tp and fp in get_detection_channels_from_state() consider both classes, then the reported f-score is not useful (not to me at least).

I can imagine that this is known to the designer of the Sigmoid class, so I can only assume that I am doing something wrong. Hopefully somebody can give me a hint :)

Note: I have submitted this question to the pylearn2 user mailing list. If I get an answer I will copy it here...

1

There are 1 best solutions below

2
On

The pylearn monitor calculates the f1 score, % misclass, etc, for each batch, not for the entire epoch. When it generates the report, the f1 score is the mean of the f1's for all the batches in the epoch. Reporting the mean over all the batches works just fine when you look at quantities like misclass:

misclass[n] is the score for the nth batch
misclass_epoch = mean(misclass[0] + misclass[1] +.. misclass[n])

however, you can't construct the same statement for the f1 score:
f1_epoch != mean(f1[0] + f1[1] +.. f1[n])
where f1[n] = 2*precision[n]*recall[n]/(precision[n] + recall[n])

For demonstration purposes, try setting the batch size to be the size of the data set (you can get away with this in the mnist example). The f1 score will then be correct.

So the best advice is keep an eye on quantities in the monitor like misclass, where the mean over the batches is the same as the value for the epoch. Once you've trained the nn, then you can make predictions for your entire validation set and calculate the f1 score at that point.