How Image and Text Embedding Vectors are brought to a single shared space?

451 Views Asked by At

I have seen mutliple multimodal architectures that takes embeddings of two different modalities like image and text and by finding similarity between these vectors they perform various down stream task like visual question answering, image to text or text to image retrieval, image captioning etc. Some good examples of such multimodal architecture are CLIP, DALL-E etc. But I have one question when image and text embeddings are made intially, they belong to different coordinate systems. For instance, let's say image embedding vector belong to some coordinate system "A" and text embedding system belongs to some entirely different coordinate system "B". How these two different vectors are projected or brought to a single embedding space. What kind of transformations or alignment is done to make them vectors of a single coordinate system let's say "C". Then obviously their dimension and size will be different. For example if i pass image from some CNN and i get 1024 dimesnional vector and text embeddings made by some transformer is of 786 dimensional vector. How they make their dimensions equal becuase for finding similarity such as cosine similarity between any two vectors requires both vector to be of same size. So first of all i want to know how these vectors are brought to same space and how their dimensions are made equal.

I am implementing CLIP model for connecting X ray images and radiology report, for this I want to how image and text embeddings are brought to single shares space.

1

There are 1 best solutions below

0
craighagerman On

CLIP takes the dot product of image and text embeddings to embed the image and text latents into the same space.

i.e. From the CLIP paper ("Learning Transferable Visual Models From Natural Language Supervision"):

# extract feature representations of each modality 
I_f = image_encoder(I) #[n, d_i] 
T_f = text_encoder(T) #[n, d_t]

# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# scaled pairwise cosine similarities [n, n] 
logits = np.dot(I_e, T_e.T) * np.exp(t)