I know that GPT uses Transformer decoder, BERT uses Transformer encoder, and T5 uses Transformer encoder-decoder. But can someone help me understand why GPT only uses the decoder, BERT only uses encoder, and T5 uses both?

What can you do with just the encoder without decoder, decoder without encoder, and both encoder and decoder?

I'm new to NLP so any help would be nice :D Thanks!

1

There are 1 best solutions below

0
On

Let me try to answer your question based on this book (Natural Language Processing with Transformers):

The transformer architecture was initially designed for sequence to sequence tasks like Machine Translation, summarization etc. This transformer architecture contains both encoders and decoders. Soon standalone models with either encoders only or decoders only were adapted.

Encoder-only models These models are mostly used for NLU (Natural Language Understanding) tasks like text classification, NER etc. These models compute the numerical representations of the input text sequence. These models computes the representation based on the left and right context of the text, often called bidirectional attention. Models like BERT falls in this category.

Decoder-only models These models are used for NLG (Natural Language Generation) tasks. The models will iteratively predict the most probable next word given a input sequence. The representations computed by these models are based only on the left context, often called causal or autoregressive attention. The models in GPT family belongs to this category.

Encoder-decoder models As the name suggests, these models can be used for more complex tasks that involves the understanding and generation of natural language like machine translation and summarization tasks. The models like BART and T5 belong to this category.

The authors also further mention that the distinction between decoder-only and encoder-only architectures is a bit blurry. For example, machine translation, which is a sequence to sequence task can be solved using GPT models. Similarly, encoder only models like BERT can also be applied to summarization tasks.