I'm working on a binary segmentation tasks using a Vision Transformer like Network. However, the ViT uses the last layer output to get the final mask while ignoring the middle layers' features. I wonder what is the function of each vision transformer layer, like, extracting the global features? If so, as the number of layers increase, the features will get better or what?
I try to visualize some of the features vector, but it is hard to tell soemthing intuitively. I want to know whether there are some papers which discuss the same problems and what are the results.