I have a simple script where I want to check the similarity between words "Cat" and "Dog"
from transformers import BertModel, BertTokenizer
import torch
from scipy.spatial.distance import cosine
# Load pre-trained BERT model and tokenizer
model_name = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
tokens_cat = tokenizer("Cat", return_tensors="pt")
tokens_dog = tokenizer("Dog", return_tensors="pt")
# Get BERT embeddings
with torch.no_grad():
embeddings_cat = model(**tokens_cat).last_hidden_state.mean(dim=1).squeeze().numpy()
embeddings_dog = model(**tokens_dog).last_hidden_state.mean(dim=1).squeeze().numpy()
# Calculate cosine similarity
cosine_similarity = 1 - cosine(embeddings_cat, embeddings_dog)
print(f"Cosine Similarity: {cosine_similarity}")
The code above returns 0.8319976329803467, which is weird because these words are not similar. Could you please tell me what I'm doing wrong?
I tried to ask chatGPT and it keeps telling me that this code is right.