I've been trying to gain intuition on how transformers work behind the scenes for language translation. I implemented in a spreadsheet in order to visualize the math behind and how the embeddings are transformed. But, there's one component that still is not clear to me which is: where the "next word mapping" is really occurring.
For example, let's consider a small inference task from Spanish "yo estoy bien" into English as "I am fine". Given an input sequence that starts with [<BOS>, 'I', 'am']
, which component of the transformer model is tasked with transforming the embedding of 'am' into the embedding for the subsequent word 'fine'? I know that all of them are somewhat involved, but which one is the key one?
Here's my current understanding of possible roles:
Masked Causal Self-Attention: Maps relationships within words in the target language (e.g., between 'I' and 'am'), but does it aid in choosing 'fine' as the next word?
Cross-Attention: Finds correlations between source and target language words (e.g., 'I' with 'yo' and 'am' with 'sono'), but is it responsible for producing the next word in the sequence?
Feed-Forward Layers: Execute linear transformations of embeddings without considering the context from other words in the sentence. How could they be responsible for selecting 'fine' following 'am' without context from other embeddings?
In short, what I see in the attention mechanisms is that they operate by adjusting an embedding to align more closely with the embeddings they share the greatest similarity with. But, what I would expect in the cross-attention is actually to see the word 'am' having a high similarity with most likely next work, 'bien'. So the new embedding from 'am' would be mapped into 'fine' which is the English equivalent embedding to 'bien'.
I hope I was able to make my question clear. It is hard for me to explain it. But I appreciate anyone who could give me some directions on the right path. Thanks