CLIP's Visual Transformer image encoder output

264 Views Asked by At

I was doing some experiments with the CLIP's visual transformer encoder output (clip-ViT-B-32). So basically given the same scene or image, it should output almost same image feature vector given it's a semantics model. But looks like it is very sensitive to illumination and lighting conditions which makes me wonder and the percentage of similarity between the images below are much lower than expected (surprisingly it says 89.45% similar)

Why is that? Is there any ways/models which are less sensitive to illumination changes and are more semantic based?

from sentence_transformers import SentenceTransformer, util
#......
model = SentenceTransformer('clip-ViT-B-32')
encoded_image = model.encode(image, batch_size=128, convert_to_tensor=True, show_progress_bar=True)

# Now we run the clustering algorithm. This function compares images aganist 
# all other images and returns a list with the pairs that have the highest 
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)

enter image description here enter image description here

0

There are 0 best solutions below