How to calculate similarity between an image and a text?

240 Views Asked by At

I have several images and I want to know if there is any aircraft in the images or not. I used the clip shown below but the output is [[1.0]], while the image is the face of humans. I think it is because it uses softmax. I tried to use logits_per_image but the value is not understandable to me tensor([[20.03]]).

Is there any way to know if an image is related to a word in percent or so? Can I use object detection in my problem to see if there are any aircraft in my image?

from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open('image_4.jpg')
inputs = processor(text=['aircraft'], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
probs.tolist()
0

There are 0 best solutions below