I have several images and I want to know if there is any aircraft in the images or not.
I used the clip shown below but the output is [[1.0]]
, while the image is the face of humans. I think it is because it uses softmax
.
I tried to use logits_per_image
but the value is not understandable to me tensor([[20.03]])
.
Is there any way to know if an image is related to a word in percent or so? Can I use object detection in my problem to see if there are any aircraft in my image?
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open('image_4.jpg')
inputs = processor(text=['aircraft'], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
probs.tolist()