Transformer's zero-padding became non-zero after passing normalization layers, making unnecessary weight updates

67 Views Asked by At

Since NLP tasks have variable-length data, we need to add paddings to make the same size with other inputs in a mini-batch. However, paddings became non-zero after passing normalization layers. This makes gradients for each padding, which makes unnecessary weight updates.

Have anyone ever seen any paper tried to solve this problem?

Reference:https://tunz.kr/post/4

I found that many other implementations are handling this issue (e.g. reset paddings to zero at every sublayer).

0

There are 0 best solutions below