padding_idx or self.num_embeddings gets changed into a string while finetuning Llama 2

29 Views Asked by At

I am trying to fine tune a Llama 2 7B model using QLORA with multiple GPUs in Databricks while following along with this example. While I am using my own dataset, I think my problems begin with adding special tokens.

model_path = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

tokenizer.add_special_tokens({'eos_token': '</s>', 'bos_token': '<s>', 'pad_token': '<pad>', 'sep_token': '<|body|>'})

This is what my config code looks like. Unfortunately we are already starting to diverge from the example quite a bit now, but I believe I do need to specify these special tokens. Just from previous experience, I find that the model performs a lot better when I've got these tokens explicity defined.

config = LlamaConfig(model_name, 
                      bos_token_id=tokenizer.bos_token_id,
                      eos_token_id=tokenizer.eos_token_id,
                      pad_token_id=tokenizer.pad_token_id,
                      sep_token_id=tokenizer.sep_token_id,
                      output_hidden_states=False)

All the code up to this point "works". But then I try to instantiate the model, and I get a very strange error. I have tried two different ways to create the model.

1.

model = LlamaForCausalLM.from_pretrained(
  model_name,
  config=config
  )
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  quantization_config=bnb_config,
  config=config,
  trust_remote_code=True
  )

Both of these ways return the same error: TypeError: '<' not supported between instances of 'int' and 'str'

Here are the last three blocks of the traceback:

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-569a19c6-ee18-4d2c-b8fa-74dc24547bca/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:732, in LlamaForCausalLM.__init__(self, config)
    730 def __init__(self, config):
    731     super().__init__(config)
--> 732     self.model = LlamaModel(config)
    733     self.pretraining_tp = config.pretraining_tp
    734     self.vocab_size = config.vocab_size

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-569a19c6-ee18-4d2c-b8fa-74dc24547bca/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:560, in LlamaModel.__init__(self, config)
    557 self.padding_idx = config.pad_token_id
    558 self.vocab_size = config.vocab_size
--> 560 self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
    561 self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
    562 self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

File /databricks/python/lib/python3.10/site-packages/torch/nn/modules/sparse.py:133, in Embedding.__init__(self, num_embeddings, embedding_dim, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse, _weight, _freeze, device, dtype)
    131 if padding_idx is not None:
    132     if padding_idx > 0:
--> 133         assert padding_idx < self.num_embeddings, 'Padding_idx must be within num_embeddings'
    134     elif padding_idx < 0:
    135         assert padding_idx >= -self.num_embeddings, 'Padding_idx must be within num_embeddings'

Somehow either the padding_idx or the self.num_embeddings got changed into a string, and I'm really not sure how or why. I get the feeling that it's just not currently possible to use special tokens with Llama 2, but if anyone has figured out how to do it let me know. I still need to try finetuning it without the special tokens defined and see how the final performance matches up, but who knows maybe I'll get the same error.

0

There are 0 best solutions below