I just recently learned BERT.
Some tutorials show that after embedding a sentence, a matrix X of [seq_len, 768] will be formed, and X will be sent to MultiHead_Attention, that is, multiple Self-Attentions.
But in fasterTransformer, why is the input [seq_len, head_num, size_per_head]? It seems that it divides the matrix X equally according to the number of heads and sends it to each head, instead of the complete matrix X.
So what is the real input?