Transformer based decoding

1.1k Views Asked by shiredude95 At 23 May 2019 at 15:39

Can the decoder in a transformer model be parallelized like the encoder? As far as I understand the encoder has all the tokens in the sequence to compute the self-attention scores. But for a decoder this is not possible (in both training and testing), as self attention is calculated based on previous timestep outputs. Even if we consider some technique like teacher forcing, where we are concatenating expected output with obtained, this still has a sequential input from the previous timestep. In this case, apart from the improvement in capturing long-term dependencies, is using a transformer-decoder better than say an lstm when comparing purely on the basis of parallelization?

Original Q&A

There are 1 best solutions below

veritessa On 28 August 2019 at 18:08

You are correct in that both an LSTM decoder and a Transformer decoder process one token at a time, i.e. they are not parallelized over the output tokens. The original Transformer architecture does not parallelize the decoder; only in the encoder is the sequence of tokens processed in parallel. For a detailed summary of the Transformer architecture and training/testing process you can see this article.

Transformer based decoding

There are 1 best solutions below

Related Questions in DEEP-LEARNING

Related Questions in TRANSFORMER-MODEL

Related Questions in SEQ2SEQ

Related Questions in ENCODER-DECODER

Related Questions in SEQUENCE-MODELING

Trending Questions

Popular # Hahtags

Popular Questions