I am training an EfficientNet lite (from scratch) on a dataset of ~10.000.000 images (128x128x1) with ~6500 classes. My training loss is converging as well as my training accuracy. However, my test loss/accuracy are fluctuating. When I test the CNN manually on some input it is looking very good and recognizes (nearly) everything correctly. Because my GPU memory is only 8GB I am training with batch size 256 and fp16 calculations.
Now my question is why does the train loss/acc is fluctuating so much and is there something to correct for that?
Here are some (maybe) important Details:
Loading the data Set:
tr_dataset = tf.keras.preprocessing.image_dataset_from_directory(
DATA_PATH,
labels="inferred",
label_mode="categorical",
interpolation="bilinear",
color_mode="grayscale",
batch_size=bs,
image_size=img_size,
shuffle=True,
seed=123,
validation_split=val_split,
subset="training"
)
My model official TF implementation:
def instantiate_char_cnn(include_augmentation=False, name=NAME):
eff_net_lite = EfficientNetLiteB0(
include_top=True,
weights=None,
input_shape=(img_size[0], img_size[1], 1),
classes=len(ls),
pooling="avg",
classifier_activation="softmax",
)
if(img_augmentation):
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(None, None, 1)),
PreprocessTFLayer(),
img_augmentation,
eff_net_lite,
],
name=name)
The custom layer for preprocessing:
@tf.function
def preprocess_tf(x):
"""
Preprocessing for TF Lite.
Args:
x : a Tensor(batch_size, height, width, channels) of images to preprocess
Return:
normalized and resized Tensor of images
"""
batch, height, width, channels = x.shape
# resize images
x = tf.image.resize(x, img_size, method=tf.image.ResizeMethod.BILINEAR)
# normalize image between [0, 1]
x = tf.math.divide(x, tf.math.reduce_max(x))
return x
class PreprocessTFLayer(tf.keras.layers.Layer):
def __init__(self, name="preprocess_tf", **kwargs):
super(PreprocessTFLayer, self).__init__(name=name, **kwargs)
self.preprocess = preprocess_tf
def call(self, input):
return self.preprocess(input)
def get_config(self):
config = super(PreprocessTFLayer, self).get_config()
return config
def get_prunable_weights(self):
return []
The keras layers for image augmentation:
from tensorflow.keras.layers.experimental.preprocessing import Resizing, Rescaling, RandomZoom, RandomRotation, RandomTranslation
img_augmentation = tf.keras.Sequential(
[
RandomErasing.RandomErasing(probability=0.4),
# random data augmentation
RandomZoom(height_factor=(-0.2, 1.0), width_factor=(-0.2, 1.0),
fill_mode='constant', interpolation='bilinear', fill_value=0.0
),
RandomTranslation(0.2, 0.2, fill_mode="constant"),
RandomRotation(factor=(-0.1, 0.1) , fill_mode='constant', interpolation='bilinear'),
],
name = "img_augmentation"
)
There could be many reasons behind this phenomenon and there could be human error involved. The key is how to troubleshoot. Manual inspection sometimes would not give you useful hints.