Error in training YOLOv3 model with custom dataset using Gluon CV: TBlob.get_with_shape mismatch

34 Views Asked by At

I'm currently training a YOLOv3 model on a custom dataset using the script from Gluon CV's GitHub repository. Here's the link to the script: https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/yolo/train_yolo3.py.

During training, I encountered an error in [Epoch 4][Batch 1199]. Given that I sometimes experience this error in various epochs such as 1, 2, or 4, I suspect there might be an issue with a particular image or set of images in my dataset. Also, I'm doing fixed YOLO3DefaultTrainTransform to 416 x 416 (width x height) in the train_loader.

The error message I received is:

Error while training model:
 ...DeepMXNet-1.5.x.36280.0/.../src/include/mxnet/././tensor_blob.h:290: 
Check failed: 
this->shape_.Size() == static_cast<size_t>(shape.Size()) 
(2884488192 vs. 18446744072299072512) : 
TBlob.get_with_shape: new and old shape do not match total elements

The shape numbers also changes between runs. Even though the shape.size() is very high in this case, but I have also seen smaller numbers such as (4758330828 vs 463363532).

Relevant portion from the traceback points towards the batchify function:

Traceback (most recent call last): 
File ".../python3.8/lib/python3.8/multiprocessing/pool.py", line 125, 
    in worker result = (True, func(*args, **kwds)) 
File "...mxnet/gluon/data/dataloader.py", line 400, 
    in _worker_fn batch = batchify_fn([_worker_dataset[i] for i in samples]) 
File "...mxnet/gluon/data/dataloader.py", line 400, 
    in <listcomp> batch = batchify_fn([_worker_dataset[i] for i in samples]) 
File "...mxnet/gluon/data/dataset.py", line 124, 
    in __getitem__ return self._fn(*item) 
File ".../lib/python3.8/site-packages/gluoncv/data/transforms/presets/yolo.py", line 195, 
    in __call__ objectness, center_targets, scale_targets, weights, class_targets = self._target_generator( 
File "...mxnet/gluon/block.py", line 548, in __call__ out = self.forward(*args) 
File ".../lib/python3.8/site-packages/gluoncv/model_zoo/yolo/yolo_target.py", line 94, 
    in forward matches = ious.argmax(axis=1).asnumpy() # (B, M) 
File "...mxnet/ndarray/ndarray.py", line 1993, 
    in asnumpy check_call(_LIB.MXNDArraySyncCopyToCPU( 
File "...mxnet/base.py", line 253, 
    in check_call raise MXNetError(py_str(_LIB.MXGetLastError()))

Environment details:

  • python version: 3.8
  • mxnet version: 1.5
  • gluon version: 0.5

Has anyone faced a similar issue or can provide insight into what might be causing this error and how to address it? Would providing additional details about my dataset or specific images help?

I have tried to reduce the batch size from 64 to 16. The only thing it does it that instead of failing at epoch 0, it fails at later epochs (2, 3, or 4).

0

There are 0 best solutions below