How to freeze parts of T5 transformer model

2.6k Views Asked by At

I know that T5 has K, Q and V vectors in each layer. It also has a feedforward network. I would like to freeze K, Q and V vectors and only train the feedforward layers on each layer of T5. I use Pytorch library. The model could be a wrapper for huggingface T5 model or a modified version of it. I know how to freeze all parameters using the following code:

tokenizer = AutoTokenizer.from_pretrained(underlying_model_name)
model = T5ForConditionalGeneration.from_pretrained(underlying_model_name)

for p in model.parameters():
    p.requires_grad = False # freezing

Could you please guide me how can I do this?

This github project probably could be helpful but it's for Roberta and GPT, could I adapt it for T5?

2

There are 2 best solutions below

1
On BEST ANSWER

I've adapted a solution based on this discussion from the Huggingface forums. Basically, you have to specify the names of the modules/pytorch layers that you want to freeze.

In your particular case of T5, I started by looking at the model summary:

from transformers import T5ModelForConditionalGeneration

model = T5ModelForConditionalGeneration.from_pretrained("t5-small")
print(model)

This gives the following (abbreviated output):

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseReluDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
[...]  # abbreviated output

with this, we can then generate a list of modules that we want to freeze. In particular, I decided to freeze the entire T5LayerSelfAttention block for the encoder (and, additionally, the T5LayerCrossAttention for the decoder):

# All modules in the 
modules_to_freeze = [model.encoder.block[i].layer[0] for i in range(len(model.encoder.block))]
# And the decoder modules, which has both a SelfAttention (layer[0]) 
modules_to_freeze.extend([model.decoder.block[i].layer[0] for i in range(len(model.decoder.block))])
# and CrossAttention (layer[1]) block
modules_to_freeze.extend([model.decoder.block[i].layer[1] for i in range(len(model.decoder.block))])

And then simply freeze all the parameters in the respective modules:

for module in modules_to_freeze:
    for param in module.parameters():
        param.requires_grad = False  # Actual freezing operation

You can verify that these are actually frozen in your model by running the following:

for param in model.parameters():
    print(param.requires_grad)

which should print quite a few False as well. If you really only want to freeze K, Q and V, you can adapt the above process to just sub-select the modules you want.

0
On

Option 1

# freeze everything
for param in model.parameters():
     param.requires_grad = False

# and Un-Freeze lower 4 layers of encoder 
for i in range(0,num_encoder_layers-8,1):
    for param in model.encoder.block[i].parameters():
        param.requires_grad = True
#verify
for name, param in model.named_parameters():
    print(name,param.requires_grad)

Option 2

# Freeze upper 3 layers of encoder (lower is unfreezed)
 for i in range(num_encoder_layers-1,num_encoder_layers-4,-1):
     for param in model.encoder.block[i].parameters():
         param.requires_grad = False

# Freeze all layers of decoder
for i in range(num_decoder_layers):
    for param in model.decoder.block[i].parameters():
        param.requires_grad = False
#verify
for name, param in model.named_parameters():
    print(name,param.requires_grad)