To calculate self-attention, For each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process defined as WQ, WK, WV matrix.

Question: are these matrices WQ, WK, WV same for every input word (embedding) or they are different for different different words?

Paper link

0

There are 0 best solutions below