After reading this post, I sort of understand how the network transforms images; however, I cannot get how it actually LEARNS which orientation is helpful for a subsequent classification step.
Almost at the end of the post and PyTorch's STN tutorial , they show how STN rotates and translates images for better classification performance.
Is it solely based on a training set? Like, if a majority of images tend to have a certain orientation, let's say rotated by 20 deg, do the network learn to rotate unrotated images?
The image will not learn to rotate 0 deg, because STN don't know which might be the right(proper) orientation for human. STN part decide the proper orientation based on network's overall accuracy improvement.
So, yes, you are right. If human annotates most of the images by keeping them 20deg rotated and says those are correct ground truth, eventually STN should generalized towards 20deg because the in that situation network model will show minimum loss(or in other words maximum accuracy) score in objective function.