how to visualize cross-attention maps for checking text-image alignment well?

517 Views Asked by At

I was wondering how to visualize cross-attention map of image features a model is looking at given a text query (e.g. sentence). There are some amazing explainable tools ilke Class Activationi Maps, but they are almost needed 'class' or CNN model (of course, there is vit-attention map too, but for classification problem). pytorch-grad-cam, vit-attention map with classes But I can't count how many classes words have. This is because each sentence is made up of different words. How can I visualize a cross-attention encoder output?

PLZ help me.

Thank you. :)

0

There are 0 best solutions below