Transformer's zero-padding became non-zero after passing normalization layers, making unnecessary weight updates

67 Views Asked by S.F. Chen At 13 August 2023 at 06:35

Since NLP tasks have variable-length data, we need to add paddings to make the same size with other inputs in a mini-batch. However, paddings became non-zero after passing normalization layers. This makes gradients for each padding, which makes unnecessary weight updates.

Have anyone ever seen any paper tried to solve this problem?

Reference:https://tunz.kr/post/4

I found that many other implementations are handling this issue (e.g. reset paddings to zero at every sublayer).

Original Q&A

Transformer's zero-padding became non-zero after passing normalization layers, making unnecessary weight updates

There are 0 best solutions below

Related Questions in NLP

Related Questions in NORMALIZATION

Related Questions in TRANSFORMER-MODEL

Related Questions in ATTENTION-MODEL

Related Questions in ZERO-PADDING

Trending Questions

Popular # Hahtags

Popular Questions