Facing CUDA Error while training MNMT model using fairseq

691 Views Asked by At

I was trying to use fairseq to train a model for English-Russian,English-French,English-Spanish,English-German data but have been getting a CUDA Error which prevents me from running the model. I have tried using multiple batch sizes,learning rate but am unable to run .

fairseq-train pre \
--arch transformer_wmt_en_de \
--task translation_multi_simple_epoch \
--encoder-langtok src --decoder-langtok --lang-pairs en-ru,en-fr,en-es,en-de \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt --lr 1e-03 --warmup-updates 4000 --max-update 100000 \
--dropout 0.3 --weight-decay 0.0001 \
--max-tokens 4096 --max-epoch 20 --update-freq 8 \
--save-interval 10 --save-interval-updates 5000 --keep-interval-updates 20 \
--log-format simple --log-interval 100 \
--save-dir checkpoints --validate-interval-updates 5000 \
--fp16 --num-workers 0 --batch-size 64

The above code is what I have used with various different parameters for batch size, learning rate, etc., but all seem to amount to a CUDA Error.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.57 GiB (GPU 0; 15.74 GiB total capacity; 5.29 GiB already allocated; 9.50 GiB free; 5.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any kind of help would be appreciated.

1

There are 1 best solutions below

3
On

The preferred way of specifying the batch size in FairSeq is via the --max-tokens argument, not --batch-size (not sure what happens if you specify both).

The batches are always padded to the same length, and the sentence lengths might be vastly different. If there is even a single very long sentence in the batch, it means that the entire batch is very large. To avoid this, the --max-tokens argument was introduced. It is set to 4096, meaning the batch size will not exceed 4096 tokens, but the number of sentences in each batch might differ. It is implemented efficiently by sorting the training sentences by their length first, then splitting them into batches, which are then shuffled randomly. This maximizes memory efficiency.

What you should do is:

  1. Remove the --batch-size argument.
  2. Try to decrease the --max-tokens argument.
  3. If it still does not help, use a smaller model.

The learning rate has no effect on memory consumption.