I was trying to use fairseq to train a model for English-Russian,English-French,English-Spanish,English-German data but have been getting a CUDA Error which prevents me from running the model. I have tried using multiple batch sizes,learning rate but am unable to run .
fairseq-train pre \
--arch transformer_wmt_en_de \
--task translation_multi_simple_epoch \
--encoder-langtok src --decoder-langtok --lang-pairs en-ru,en-fr,en-es,en-de \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt --lr 1e-03 --warmup-updates 4000 --max-update 100000 \
--dropout 0.3 --weight-decay 0.0001 \
--max-tokens 4096 --max-epoch 20 --update-freq 8 \
--save-interval 10 --save-interval-updates 5000 --keep-interval-updates 20 \
--log-format simple --log-interval 100 \
--save-dir checkpoints --validate-interval-updates 5000 \
--fp16 --num-workers 0 --batch-size 64
The above code is what I have used with various different parameters for batch size, learning rate, etc., but all seem to amount to a CUDA Error.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.57 GiB (GPU 0; 15.74 GiB total capacity; 5.29 GiB already allocated; 9.50 GiB free; 5.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Any kind of help would be appreciated.
The preferred way of specifying the batch size in FairSeq is via the
--max-tokens
argument, not--batch-size
(not sure what happens if you specify both).The batches are always padded to the same length, and the sentence lengths might be vastly different. If there is even a single very long sentence in the batch, it means that the entire batch is very large. To avoid this, the
--max-tokens
argument was introduced. It is set to 4096, meaning the batch size will not exceed 4096 tokens, but the number of sentences in each batch might differ. It is implemented efficiently by sorting the training sentences by their length first, then splitting them into batches, which are then shuffled randomly. This maximizes memory efficiency.What you should do is:
--batch-size
argument.--max-tokens
argument.The learning rate has no effect on memory consumption.