Adapters after QLoRA fine-tuning on a llama architecture model reach about 2 GB, which is very far from the general trend seen online

33 Views Asked by Killua At 20 March 2024 at 16:43

I was Fine Tuning a Llama Architecture Model that supports multiple languages: English, Hindi as well as Roman Hindi. So, I loaded the model in quantized form using bitsandbytes in nf4 form along with double quantization.

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config = bnb_config,
    device_map = {"":0},
    trust_remote_code = True,
)

I was fine tuning this model for chat format. So, I added a token to the tokenizer for this model, added the token to the model config and resized the model embeddings according to the final length of the tokenizer.

**# Adding a new padding token to the tokenizer even though one is already present in the tokenizer [PAD]. It would be very helpful if someone can point out the reason for this.**

if '<pad>' not in tokenizer.get_vocab():
    print("Token Added")
    # Add the pad token
    tokenizer.add_tokens(['<pad>'])

# Setting the pad token
tokenizer.pad_token = '<pad>'

voca = tokenizer.get_vocab()

# Resize Token Embeddings
model.resize_token_embeddings(len(tokenizer))

# Updating pad token id in model and its config
model.pad_token_id = tokenizer.pad_token_id
model.config.pad_token_id = tokenizer.pad_token_id

assert model.pad_token_id == tokenizer.pad_token_id, "The model's pad token ID does not match the tokenizer' pad token ID"


model = get_peft_model(model, config)
print_trainable_parameters(model)

Then I set the LoRA Adapters for fine tuning:

r = 16,
lora_rank = 32,
target_modules = ["q_proj", "v_proj", "down_proj", "gate_proj", "up_proj", "k_proj"],
lora_dropout = 0.05,
task_type = "CAUSAL_LM",

model = get_peft_model(model, config)

trainable params: 35782656 || all params: 3667800064 || trainable%: 0.9755890554453079

Now, the vocab size of the model is 48065. After fine tuning the model with about 80 examples in Hindi, Roman Hindi and some English prompts.

Traing Arguments:

args = transformers.TrainingArguments(
        num_train_epochs=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=2,
        max_grad_norm=1,
        warmup_ratio=0.1,
        learning_rate=1e-4,
        fp16 = True,
        logging_steps = 1,
        output_dir = "outputs",
        optim = "paged_adamw_8bit",
        lr_scheduler_type = 'cosine',
    ),
    data_collator=data_collator,
)

model.config.use_cache = False

Then after training this model I am pushing this model onto huggingface using model.push_to_hub(repo_id = rep_id) # This is the adapter and not the merged model, right?

This model is 1.72 GB which is not the case anywhere online, Is it ok or what might be the issue?
When wanting to merge this adapter with the base model I get CUDA out of Memory for Turing T4 GPU, which is understandable.
So, I save the adapter and then load model again and resize the embedding for this loaded model and then add the adapters to this model for inference, but still I am not able to do inference due to the huge size of the model with added adapters now because of large adapter file.
How do I resolve this issue of large adapter file, and combine the base model with the adapter?
Also, when trying to load the saved adapter from local using:

device = torch.device("cuda" if torch.cuda.is_available else "cpu")
model = AutoModelForCausalLM.from_pretrained("path_to_adapter_file", local_files_only = True, device_map = device)

The CUDA Memory gets full 16GB, I dont know the reason, even though the adapter file is 1.72 GB???

Am I making any mistake? Is there any solution to this? Thanks

Original Q&A

Adapters after QLoRA fine-tuning on a llama architecture model reach about 2 GB, which is very far from the general trend seen online

There are 0 best solutions below

Related Questions in NLP

Related Questions in CHATBOT

Related Questions in FINE-TUNING

Related Questions in QUANTIZATION-AWARE-TRAINING

Trending Questions

Popular # Hahtags

Popular Questions