I was wondering how to visualize cross-attention map of image features a model is looking at given a text query (e.g. sentence). There are some amazing explainable tools ilke Class Activationi Maps, but they are almost needed 'class' or CNN model (of course, there is vit-attention map too, but for classification problem). pytorch-grad-cam, vit-attention map with classes But I can't count how many classes words have. This is because each sentence is made up of different words. How can I visualize a cross-attention encoder output?
PLZ help me.
Thank you. :)