I got NaN for all losses while training YOLOv8 model

2.5k Views Asked by At

I am training yolov8 model on cuda using this code :

from ultralytics import YOLO
import torch
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
model = YOLO("yolov8n.pt")  # load a pretrained model (recommended for training)
results = model.train(data="data.yaml", epochs=15, workers=0, batch=12)  
results = model.val()  
model.export(format="onnx")

and I am getting Nan for all losses

Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
1/15      1.74G        nan        nan        nan         51        640:   4%

I have tried training a model on cpu and it worked fine. the problem appeared when I installed cuda and started training on it.

I expected that there was an error reading the data or something but everything works fine.

I think it has something to do with memory because when I decreased the image size for the model it worked fine, but when I increased batch size for the same decreased image size it showed NaN again. so it's a trade of between image size, batch size and memory. I am not sure 100% if that is right. but that is what I figured out by experiment. but if you have good answer for this problem, please share it.

4

There are 4 best solutions below

0
On

I had a similar issue but found that it went away when I upgraded to the most recent version of ultralytics. Everything was working in an environment with ultralytics 8.0.26, and then I saw the NaN loss issue in an environment with 8.0.30-something. Creating a new environment with ultralytics 8.0.42 seemed to solve the problem.

0
On

I was having the same problem trying to train to my custom dataset. As someone else here recommended, I also tried downgrading the ultralytics version to 8.0.42, but that didn't work. What did fix it was to run the command as below:

yolo task=detect mode=train model=yolov8s.pt data="./data/data.yaml" epochs=50 batch=8 imgsz=640 device=0 workers=8 optimizer=Adam pretrained=true dropout=true val=true plots=true half=true save=True show=true save_txt=true save_conf=true save_crop=true optimize=true lr0=0.001 lrf=0.01 fliplr=0.0

Try opening the args file (runs\detectrain\args.yaml) and keep changing the parameters based on what is available there or in docs (https://docs.ultralytics.com/cfg/), maybe at some point you can solve the problem. I believe that the main parameter you should try to change is the device to "cpu".

I suspect that the problem may be with GTX16 series as discussed here (https://github.com/ultralytics/ultralytics/issues/1148).

0
On

I had the same issue. Even after upgrading ultralytics to its latest version 8.0.94 and setting the batch size to a lower value, it did not help me. When I set the device to CPU device=cpu, it works perfectly fine.

so, the problem was mainly with the GPU. As suggested by the github issue, setting amp=False fixed it and I was able to run it on GPU.

yolo task=detect mode=train model=yolov8s.pt data="data.yaml" epochs=20 batch=2 imgsz=640 device=0 workers=8 optimizer=Adam pretrained=true val=true plots=true save=True show=true optimize=true lr0=0.001 lrf=0.01 fliplr=0.0 amp=False
0
On

set batch=2 try again, I sovled the problem by this way