During training of a CNN (RESNET50 on ImageNet) I'm observing a strange "staircase" shape (see the descending blue line in the plots linked below--sorry for the obscured legend text).
What might cause this? I use TensorFlow Datasets to get/generate the data, and the data is shuffled. I use batch size of 40 and 1000 batches (iterations) per "epoch", so each point on the plot represents 40,000 samples. The staircase steps down every 32 points, meaning each step spans 128,000 samples (much less than ImageNet's > 1,000,000 training images). I use the Adam optimizer with a fixed learning rate of 0.01.
Any guesses as to what may be causing this?