Dealing with over-fitting: data enlargement, cross-validation, rotation-augmentation

682 Views Asked by At

Currently, I am just exploring nets provided by tflearn (VGG.Net, GoogLeNet, ResNet, etc) and applying those to my dataset (128*128 single channel image, 925 images-before augmentation, 5058 images-after augmentation, two classes-cancerous&non-cancerous).

  1. Problem: much discrepancy between training accuracy (~100%) and validation accuracy (~70%).

  2. My approach: 1) Reducing the model complexity by reducing # of convolutional kernel, 2) reducing # of nodes in Fully connected layer, 3) enlarging the drop-out rate at FC.

  3. Question:

1) Could this over-fitting problem have occurred -at least at some degree- by insufficient (training) dataset? I think if I have much more (training) dataset, this would sufficiently represent the mother distribution (including validation dataset too) so that the validation accuracy would be similar to the training accuracy.

2) Would cross validation help reduce the discrepancy? However, if I have a test set which will never be used as a training set, I think my test acc will still show much difference from training acc. Is that correct?

3) As far as I know, shift-augmentation wouldn't provide new info since convolution is shift-invariant. How about rotation? (rotation before slicing the ROI so that the image wouldn't contain zeros at the boundary)

Thanks!! :D

2

There are 2 best solutions below

0
On
  1. Yes
  2. No, not if you don't change the size of your training dataset. However, cross validation is often used to use more of your data as training data.
  3. Rotation will only help, if it is present in the dataset. For example, 180° rotation might actually do harm.

Good augmentations for standard images can be found in the tensorflow CIFAR10 example:

  • tf.random_crop(reshaped_image, [height, width, 3])
  • tf.image.random_flip_left_right(distorted_image)
  • tf.image.random_brightness(distorted_image, max_delta=63)
  • tf.image.random_contrast(distorted_image, lower=0.2, upper=1.8)

To fight overfitting, you might want to introduce regularization; especially Dropout (tf.nn.dropout).

However, it does not have to be overfitting. It could also be that the distribution of your test data is different from your training data (But overfitting is more likely).

0
On

My 2 cents:

  1. Cross-validation might help (or not). It depends.

The idea behind CV is to resample the limited available training data and average out the accuracy of the model. Imagine the case when CV is not applied and there is a huge outlier present in the test data (which was ignored while considering the training data). The model is skewed and might result in anomalies.