Can't run fine-tuning for llama 7b with LORA (OOM)

45 Views Asked by At

I am trying to fine-tune llama 7b using Lora + deepseed, I hit OOM all the time (I have rtx 3090 24 GB + 32GB RAM + 80G SWAP). it looks like it doesn't offload it to CPU/RAM. I tried different parameters with diff values, but it looks it is just don't offload it and try to run it on GPU only. I spent couple of days trying to run it, pls help

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 23.47 GiB of which 100.38 MiB is free. Including non-PyTorch memory, this process has 21.66 GiB memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, and 11.11 MiB is reserved by PyTorch but unallocated.

def load_model_and_tokenizer(model_name, project_tag):
    model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir="./llama")
    tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir="./llama")
....

loraConfig = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    # target_modules=lora_target_modules,
    bias="none",
    task_type="CAUSAL_LM",
)

project_tag = "<projectX>"
model_name = "meta-llama/Llama-2-7b-chat-hf"
model, tokenizer = load_model_and_tokenizer(model_name, project_tag)
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, loraConfig)
file_path = "encodedProjectData.txt"
unlabeled_texts = read_data_from_file(file_path)
unlabeled_dataset = CustomUnlabeledDataset(unlabeled_texts, tokenizer)

training_arguments=TrainingArguments(
        auto_find_batch_size=True,
        optim="adafactor", gradient_checkpointing=True,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=60,
        learning_rate=2e-4,
        evaluation_strategy = 'steps',
        eval_accumulation_steps = 1,
        eval_steps = 10,
        seed =  42,
        # report_to="wandb",
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    )

# Wrap the model with DeepSpeed
model, optimizer, _, _ = deepspeed.initialize(model=model, args=training_arguments, config=deepspeed_config)

trainer = Trainer(
    model=model,
    train_dataset=unlabeled_dataset,
    args=training_arguments,

)
trainer.train()

deepspeed_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": "auto"
    },
    "steps_per_print": 2000,
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}

I tried different TrainingArguments and different setups to run fine tunning

0

There are 0 best solutions below