How can I train an LLM through Hugging Face using all the computational power at my disposal?

102 Views Asked by Giacomo Saracchi At 07 February 2024 at 19:12

I'm a beginner asking for guidance on training his first LLM with Hugging Face.

I'm trying to fine tune DeBERTa through Hugging Face to classify text in 42 categories. To do so, I have scraped, cleaned, and labelled a dataset from Wikipedia, which resulted in about a 900k training set, 100k validation set, and 100k testing set with features ['paragraph', 'label']. I tokenized the paragraphs appropriately and saved all the sets on my disk. I loaded them onto my model training file and they work fine.

The trouble started when I tried to train the model. Below is my code:

#!/usr/bin/env python
# coding: utf-8

import os

from datasets import load_from_disk
from transformers import (
    DebertaForSequenceClassification, # DebertaPreTrainedModel
    Trainer, 
    TrainingArguments,
    AutoConfig
)

def main():
    data_path = '/Users/giacomosaracchi/Downloads/question_to_topics/subtopic_labelling/model/datasets'
    # load datasets - all already tokenized
    print('1. Loading datasets...')
    train_set = load_from_disk(f'{data_path}/train_set.hf')
    val_set = load_from_disk(f'{data_path}/val_set.hf')

    # init and config model
    print("2. Initializing model...")
    model = DebertaForSequenceClassification.from_pretrained('microsoft/deberta-base', num_labels=42)
    label_to_int = {
        'Agricultural Sciences': 0, 'Anthropology': 1, 'Architecture': 2,
        'Art': 3, 'Biology': 4, 'Business': 5, 'Chemistry': 6, 'Computer Science': 7,
        'Dance': 8, 'Design': 9, 'Earth Sciences': 10, 'Economics': 11, 'Education': 12,
        'Engineering': 13, 'Ethics': 14, 'Film': 15, 'Finance': 16, 'Geography': 17,
        'History': 18, 'Law': 19, 'Linguistics': 20, 'Literature': 21, 'Logic': 22,
        'Materials Science': 23, 'Mathematics': 24, 'Medicine': 25, 'Military Science': 26,
        'Music': 27, 'Philosophy': 28, 'Physics': 29, 'Politics': 30, 'Psychology': 31,
        'Religion': 32, 'Skills & Qualities': 33, 'Sociology': 34, 'Space Sciences': 35,
        'Sports & Health': 36, 'Systems Science': 37, 'Television': 38, 'Theatre': 39,
        'Theology': 40, 'Transportation': 41
    }
    # config label int to label str map in the model
    config = AutoConfig.from_pretrained('microsoft/deberta-base')
    id2label = {label_int: label_str for label_str, label_int in label_to_int.items()}
    config.update({"id2label": id2label})

    # Train Model
    hf_token = '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'
    local_repo = './trainer_outputs'
    hf_repo = 'bright-est-2021/bright-deberta-classifier'

    # Training Arguments & Trainer
    # TrainingArguments
    print('3. Initializing training arguments and trainer...')
    training_args = TrainingArguments(
        output_dir=local_repo,
        num_train_epochs=5,
        # per_device_train_batch_size=8,
        # per_device_eval_batch_size=8,
        evaluation_strategy="epoch",
        logging_dir=f"{local_repo}/logs",
        logging_strategy="steps",
        logging_steps=10,
        learning_rate=5e-5,
        weight_decay=0.01,
        warmup_steps=500,
        save_strategy="epoch",
        load_best_model_at_end=True,
        save_total_limit=2,
        report_to="tensorboard",
        push_to_hub=True,
        hub_strategy="every_save",
        hub_model_id=hf_repo,
        hub_token=hf_token,
        # dataloader_num_workers=os.cpu_count(),
        # ddp_backend='gloo'
    )

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_set,
        eval_dataset=val_set,
    )

    # Train and Evaluate
    # fine tune the model
    print('4. Starting training...')
    trainer.train()

    # Evaluate the model
    print('5. Evaluating model...')
    eval_results = trainer.evaluate()
    print(eval_results)

    print('6. Pushing model to hub...')
    trainer.push_to_hub()

if __name__ == '__main__':
    main()

To try and use all the available resources on my machine (8 CPUs) I'd like the training to run in parallel. Once it works on my machine, I'll transfer it to a cloud VM to speed up the training. To do that I tried these 2 options:

dataloader_num_workers parameter in the TrainerArguments class, setting it equal to the number of CPUs on the machine running the code.
accelerate library commands from the shell: accelerate config to pass all the relevant parameters, then accelerate launch script.py

Below is the printout of my accelerate env:

- `Accelerate` version: 0.24.1
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.18
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.1.0 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 16.00 GB
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_CPU
    - mixed_precision: bf16
    - use_cpu: True
    - debug: False
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - ipex_config: {'ipex': False}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

And here are the packages I'm using:

datasets==2.12.0
numpy==1.24.3
pandas==1.5.3
scikit_learn==1.3.2
torch==2.1.0
tqdm==4.65.0
transformers==4.32.1
accelerate==0.26.1
tensorboardX==2.6.2.2

I tried all combinations of the above options because I don't fully understand how they interact, and observed the outcome on my shell and Activity Monitor. Without specifying dataloader_num_workers only one process is generated, regardless of whether I run the script with accelerate or simply python. When specifying dataloader_num_workers and running woth python the training begins (shows progress bar), 8 processes spin up on my activity monitor, and then 7 of them crash with no error and 1 of them survives and continues the training. The CPU load moves accordingly, maxing out when all processes are alive and decreasing to a tiny amount when only 1 is left. When specifying dataloader_num_workers and running with accelerate the print statements in the code repeat 8 times each, and then the same problem of 8 processes spinning up and 7 dying takes place.

Again, the goal is to fix this problem and use all available resources on my machine to then move the process on the cloud and leverage GPUs to get this done faster.

Thanks so much for your help!!!

Original Q&A

How can I train an LLM through Hugging Face using all the computational power at my disposal?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PYTORCH

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in ACCELERATE

Related Questions in HUGGINGFACE-TRAINER

Trending Questions

Popular # Hahtags

Popular Questions