CLIP's Visual Transformer image encoder output

267 Views Asked by angel_30 At 26 November 2025 at 06:45

I was doing some experiments with the CLIP's visual transformer encoder output (clip-ViT-B-32). So basically given the same scene or image, it should output almost same image feature vector given it's a semantics model. But looks like it is very sensitive to illumination and lighting conditions which makes me wonder and the percentage of similarity between the images below are much lower than expected (surprisingly it says 89.45% similar)

Why is that? Is there any ways/models which are less sensitive to illumination changes and are more semantic based?

from sentence_transformers import SentenceTransformer, util
#......
model = SentenceTransformer('clip-ViT-B-32')
encoded_image = model.encode(image, batch_size=128, convert_to_tensor=True, show_progress_bar=True)

# Now we run the clustering algorithm. This function compares images aganist 
# all other images and returns a list with the pairs that have the highest 
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)

Original Q&A

CLIP's Visual Transformer image encoder output

There are 0 best solutions below

Related Questions in DEEP-LEARNING

Related Questions in COMPUTER-VISION

Related Questions in TRANSFORMER-MODEL

Related Questions in SENTENCE-TRANSFORMERS

Related Questions in OPENAI-CLIP

Trending Questions

Popular # Hahtags

Popular Questions