So I'm trying to train a SimCLR network with a custom lightweight ConvNet backbone (tried it with a ResNet already) on a dataset containing first 5 letters of the alphabet out of which two are randomly selected and placed in random positions in the image. I am unsure of what augmentations to use in such a scenario, so I only use Image translation to provide some degree of difference between the augmented samples.
This sounds like an extremely trivial task, but it performs VERY poorly on a multi-label classifier built on top of the frozen pretrained network. I'm quite certain this is because of how poor the quality of representations learnt are rather than the linear classifier. This works well on a supervised classifier, obviously.
Variations I've tried till now:
- Made the dataset single letter, random position (multi-class) and it performed very well.
- Made the dataset with random letters, but same center position, and it performed well. Same augmentation mentioned above for these as well.
Sample image from dataset (Label here is [1, 1, 0, 0, 0] for the letters that are present)
Can someone please help me figure out how to make this work?
This is not the first time I hear of someone trying SimCLR and getting horrible results...
I have some questions: