I was Fine Tuning a Llama Architecture Model that supports multiple languages: English, Hindi as well as Roman Hindi. So, I loaded the model in quantized form using bitsandbytes in nf4 form along with double quantization.
bnb_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config = bnb_config,
device_map = {"":0},
trust_remote_code = True,
)
I was fine tuning this model for chat format. So, I added a token to the tokenizer for this model, added the token to the model config and resized the model embeddings according to the final length of the tokenizer.
**# Adding a new padding token to the tokenizer even though one is already present in the tokenizer [PAD]. It would be very helpful if someone can point out the reason for this.**
if '<pad>' not in tokenizer.get_vocab():
print("Token Added")
# Add the pad token
tokenizer.add_tokens(['<pad>'])
# Setting the pad token
tokenizer.pad_token = '<pad>'
voca = tokenizer.get_vocab()
# Resize Token Embeddings
model.resize_token_embeddings(len(tokenizer))
# Updating pad token id in model and its config
model.pad_token_id = tokenizer.pad_token_id
model.config.pad_token_id = tokenizer.pad_token_id
assert model.pad_token_id == tokenizer.pad_token_id, "The model's pad token ID does not match the tokenizer' pad token ID"
model = get_peft_model(model, config)
print_trainable_parameters(model)
Then I set the LoRA Adapters for fine tuning:
r = 16,
lora_rank = 32,
target_modules = ["q_proj", "v_proj", "down_proj", "gate_proj", "up_proj", "k_proj"],
lora_dropout = 0.05,
task_type = "CAUSAL_LM",
model = get_peft_model(model, config)
trainable params: 35782656 || all params: 3667800064 || trainable%: 0.9755890554453079
Now, the vocab size of the model is 48065. After fine tuning the model with about 80 examples in Hindi, Roman Hindi and some English prompts.
Traing Arguments:
args = transformers.TrainingArguments(
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
max_grad_norm=1,
warmup_ratio=0.1,
learning_rate=1e-4,
fp16 = True,
logging_steps = 1,
output_dir = "outputs",
optim = "paged_adamw_8bit",
lr_scheduler_type = 'cosine',
),
data_collator=data_collator,
)
model.config.use_cache = False
Then after training this model I am pushing this model onto huggingface using model.push_to_hub(repo_id = rep_id) # This is the adapter and not the merged model, right?
- This model is 1.72 GB which is not the case anywhere online, Is it ok or what might be the issue?
- When wanting to merge this adapter with the base model I get CUDA out of Memory for Turing T4 GPU, which is understandable.
- So, I save the adapter and then load model again and resize the embedding for this loaded model and then add the adapters to this model for inference, but still I am not able to do inference due to the huge size of the model with added adapters now because of large adapter file.
- How do I resolve this issue of large adapter file, and combine the base model with the adapter?
- Also, when trying to load the saved adapter from local using:
device = torch.device("cuda" if torch.cuda.is_available else "cpu")
model = AutoModelForCausalLM.from_pretrained("path_to_adapter_file", local_files_only = True, device_map = device)
The CUDA Memory gets full 16GB, I dont know the reason, even though the adapter file is 1.72 GB???
Am I making any mistake? Is there any solution to this? Thanks