I have a model that looks like below, taking a 224x224x3 image input and classifying it into one of two categories:
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
rescaling_3 (Rescaling) (None, 224, 224, 3) 0
mobilenetv2_1.00_224 (Funct (None, 7, 7, 1280) 2257984
ional)
spatial_dropout2d_12 (Spati (None, 7, 7, 1280) 0
alDropout2D)
conv2d_9 (Conv2D) (None, 7, 7, 2048) 2623488
spatial_dropout2d_13 (Spati (None, 7, 7, 2048) 0
alDropout2D)
conv2d_10 (Conv2D) (None, 7, 7, 1024) 2098176
spatial_dropout2d_14 (Spati (None, 7, 7, 1024) 0
alDropout2D)
conv2d_11 (Conv2D) (None, 7, 7, 256) 262400
spatial_dropout2d_15 (Spati (None, 7, 7, 256) 0
alDropout2D)
flatten_3 (Flatten) (None, 12544) 0
dropout_3 (Dropout) (None, 12544) 0
dense_3 (Dense) (None, 2) 25090
=================================================================
Total params: 7,267,138
Trainable params: 5,009,154
Non-trainable params: 2,257,984
_________________________________________________________________
And here are the compile parameters:
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(lr=0.001),
metrics=['categorical_accuracy'])
When I train it for any number of epochs, I get a "categorical accuracy" that is around 70% (chance would be 50%). This is the number produced by model.evaluate(test_set)
However, when I actually compare the argmax() of each prediction to the test_set.labels list of correct labels, they are only equal in about 50% of cases. I.e. no better than chance.
This is happening consistently. I understand that maybe the problem is that the problem is too hard to classify, but where does the 70/50 discrepancy come from?
I expected the test set performance to match the output of model.evaluate(test_set), but the performance is much worse.