I'm a beginner asking for guidance on training his first LLM with Hugging Face.
I'm trying to fine tune DeBERTa through Hugging Face to classify text in 42 categories. To do so, I have scraped, cleaned, and labelled a dataset from Wikipedia, which resulted in about a 900k training set, 100k validation set, and 100k testing set with features ['paragraph', 'label']. I tokenized the paragraphs appropriately and saved all the sets on my disk. I loaded them onto my model training file and they work fine.
The trouble started when I tried to train the model. Below is my code:
#!/usr/bin/env python
# coding: utf-8
import os
from datasets import load_from_disk
from transformers import (
DebertaForSequenceClassification, # DebertaPreTrainedModel
Trainer,
TrainingArguments,
AutoConfig
)
def main():
data_path = '/Users/giacomosaracchi/Downloads/question_to_topics/subtopic_labelling/model/datasets'
# load datasets - all already tokenized
print('1. Loading datasets...')
train_set = load_from_disk(f'{data_path}/train_set.hf')
val_set = load_from_disk(f'{data_path}/val_set.hf')
# init and config model
print("2. Initializing model...")
model = DebertaForSequenceClassification.from_pretrained('microsoft/deberta-base', num_labels=42)
label_to_int = {
'Agricultural Sciences': 0, 'Anthropology': 1, 'Architecture': 2,
'Art': 3, 'Biology': 4, 'Business': 5, 'Chemistry': 6, 'Computer Science': 7,
'Dance': 8, 'Design': 9, 'Earth Sciences': 10, 'Economics': 11, 'Education': 12,
'Engineering': 13, 'Ethics': 14, 'Film': 15, 'Finance': 16, 'Geography': 17,
'History': 18, 'Law': 19, 'Linguistics': 20, 'Literature': 21, 'Logic': 22,
'Materials Science': 23, 'Mathematics': 24, 'Medicine': 25, 'Military Science': 26,
'Music': 27, 'Philosophy': 28, 'Physics': 29, 'Politics': 30, 'Psychology': 31,
'Religion': 32, 'Skills & Qualities': 33, 'Sociology': 34, 'Space Sciences': 35,
'Sports & Health': 36, 'Systems Science': 37, 'Television': 38, 'Theatre': 39,
'Theology': 40, 'Transportation': 41
}
# config label int to label str map in the model
config = AutoConfig.from_pretrained('microsoft/deberta-base')
id2label = {label_int: label_str for label_str, label_int in label_to_int.items()}
config.update({"id2label": id2label})
# Train Model
hf_token = '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'
local_repo = './trainer_outputs'
hf_repo = 'bright-est-2021/bright-deberta-classifier'
# Training Arguments & Trainer
# TrainingArguments
print('3. Initializing training arguments and trainer...')
training_args = TrainingArguments(
output_dir=local_repo,
num_train_epochs=5,
# per_device_train_batch_size=8,
# per_device_eval_batch_size=8,
evaluation_strategy="epoch",
logging_dir=f"{local_repo}/logs",
logging_strategy="steps",
logging_steps=10,
learning_rate=5e-5,
weight_decay=0.01,
warmup_steps=500,
save_strategy="epoch",
load_best_model_at_end=True,
save_total_limit=2,
report_to="tensorboard",
push_to_hub=True,
hub_strategy="every_save",
hub_model_id=hf_repo,
hub_token=hf_token,
# dataloader_num_workers=os.cpu_count(),
# ddp_backend='gloo'
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_set,
eval_dataset=val_set,
)
# Train and Evaluate
# fine tune the model
print('4. Starting training...')
trainer.train()
# Evaluate the model
print('5. Evaluating model...')
eval_results = trainer.evaluate()
print(eval_results)
print('6. Pushing model to hub...')
trainer.push_to_hub()
if __name__ == '__main__':
main()
To try and use all the available resources on my machine (8 CPUs) I'd like the training to run in parallel. Once it works on my machine, I'll transfer it to a cloud VM to speed up the training. To do that I tried these 2 options:
- dataloader_num_workers parameter in the TrainerArguments class, setting it equal to the number of CPUs on the machine running the code.
- accelerate library commands from the shell:
accelerate configto pass all the relevant parameters, thenaccelerate launch script.py
Below is the printout of my accelerate env:
- `Accelerate` version: 0.24.1
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.18
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.1.0 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 16.00 GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_CPU
- mixed_precision: bf16
- use_cpu: True
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- ipex_config: {'ipex': False}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
And here are the packages I'm using:
datasets==2.12.0
numpy==1.24.3
pandas==1.5.3
scikit_learn==1.3.2
torch==2.1.0
tqdm==4.65.0
transformers==4.32.1
accelerate==0.26.1
tensorboardX==2.6.2.2
I tried all combinations of the above options because I don't fully understand how they interact, and observed the outcome on my shell and Activity Monitor. Without specifying dataloader_num_workers only one process is generated, regardless of whether I run the script with accelerate or simply python. When specifying dataloader_num_workers and running woth python the training begins (shows progress bar), 8 processes spin up on my activity monitor, and then 7 of them crash with no error and 1 of them survives and continues the training. The CPU load moves accordingly, maxing out when all processes are alive and decreasing to a tiny amount when only 1 is left. When specifying dataloader_num_workers and running with accelerate the print statements in the code repeat 8 times each, and then the same problem of 8 processes spinning up and 7 dying takes place.
Again, the goal is to fix this problem and use all available resources on my machine to then move the process on the cloud and leverage GPUs to get this done faster.
Thanks so much for your help!!!