What memory does Transformer Decoder Only use?

2.8k Views Asked by bellerb At 28 October 2025 at 13:29

I've been reading a lot about transformers and self attention and have seen both BERT and GPT-2 are a newer version that only use an encoder transformer (BERT) and decoder transformer (GPT-2). I've been trying to build a decoder only model for myself for next sequence prediction but am confused by one thing. I'm using PyTorch and have looked at thereSeq2Seq tutorial and then looked into the Transformer Decoder Block which is made up of Transformer Decoder Layers. My confusion comes from the memory these need to be passed as well. In the documentation they say memory is the last layer of the encoder block which makes sense for a Seq2Seq model but I'm wanting to make a decoder only model. So my question is what do you pass a decoder only model like GPT-2 for memory if you do not have an encoder?

Original Q&A

There are 1 best solutions below

bellerb On 15 January 2021 at 12:10

After further investigation I believe I can now answer this myself. A decoder only transformer doesn't actually use any memory as there is no encoder-decoder self attention in it like there is in a encoder-decoder transformer. A decoder only transformer looks a lot like an encoder transformer only instead it uses a masked self attention layer over a self attention layer. In order to do this you can pass a square subsequent mask (upper triangle) so that the model cannot look forward to achieve a decoder only model like found in GPT-2/GPT-3.

What memory does Transformer Decoder Only use?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTORCH

Related Questions in DECODER

Related Questions in TRANSFORMER-MODEL

Related Questions in GPT-2

Trending Questions

Popular # Hahtags

Popular Questions