I am trying to run a code from github. The file is called train.py. It is supposed to run a Neural Network for training on a dataset. However, I get the following error
(QGN) ubuntu@ip-172-31-13-114:~/QGN$ python train.py
Input arguments:
id ade20k
arch_encoder resnet50
arch_decoder QGN_dense_resnet34
weights_encoder
weights_decoder
fc_dim 2048
list_train ./data/train_ade20k.odgt
list_val ./data/validation_ade20k.odgt
root_dataset ./data/
num_gpus 0
batch_size_per_gpu 2
num_epoch 20
start_epoch 1
epoch_iters 5000
optim SGD
lr_encoder 0.02
lr_decoder 0.02
lr_pow 0.9
beta1 0.9
weight_decay 0.0001
deep_sup_scale 1.0
prop_weight 2.0
enhance_weight 2.0
fix_bn 0
num_val 500
num_class 150
transform_dict None
workers 40
imgSize [300, 375, 450, 525, 600]
imgMaxSize 1000
cropSize 0
padding_constant 32
random_flip True
seed 1337
ckpt ./ckpt
disp_iter 20
visualize False
result ./result
gpu_id 0
Model ID: ade20k-resnet50-QGN_dense_resnet34-batchSize0-LR_encoder0.02-LR_decoder0.02-epoch20-lossScale1.0-classScale2.0
# samples: 20210
1 Epoch = 5000 iters
Starting Training!
Traceback (most recent call last):
File "train.py", line 355, in <module>
main(args)
File "train.py", line 217, in main
train(segmentation_module, iterator_train, optimizers, history, epoch, args)
File "train.py", line 33, in train
batch_data = next(iterator)
File "/home/ubuntu/QGN/lib/utils/data/dataloader.py", line 274, in __next__
raise StopIteration
StopIteration
Segmentation fault (core dumped)
The code from train.py (lines 211 to 231) is as follows '''
Main loop
history = {'train': {'epoch': [], 'loss': [], 'acc': []}}
print('Starting Training!')
for epoch in range(args.start_epoch, args.num_epoch + 1):
train(segmentation_module, iterator_train, optimizers, history, epoch, args)
# checkpointing
checkpoint(nets, history, args, epoch)
# evaluation
args.weights_encoder = os.path.join(args.ckpt, 'encoder_epoch_' + str(epoch) + '.pth')
args.weights_decoder = os.path.join(args.ckpt, 'decoder_epoch_' + str(epoch) + '.pth')
iou = eval_train(args)
# adaptive class weighting
adjust_crit_weights(segmentation_module, iou, args)
print('Training Done!')
'''
I am not sure if I have shared all the required information. I would appreciate if ant help could be provided to resolve this issue. Just to inform, I have tried using the try and except method as shared on github on the link https://github.com/amdegroot/ssd.pytorch/issues/214. However the error still persists.
The code from line 30 in train.py is as follows
# main loop
tic = time.time()
for i in range(args.epoch_iters):
batch_data = next(iterator)
data_time.update(time.time() - tic)
segmentation_module.zero_grad()
I ammended the above code as follows
# main loop
loader_train = torchdata.DataLoader(
dataset_train,
batch_size=args.num_gpus, # we have modified data_parallel
shuffle=False, # we do not use this param
collate_fn=user_scattered_collate,num_workers=int(args.workers),
drop_last=True,
pin_memory=True)
tic = time.time()
for i in range(args.epoch_iters):
try:
batch_data = next(iterator)
except StopIteration:
iterator = iter(loader_train)
batch_data = next(iterator)
data_time.update(time.time() - tic)
segmentation_module.zero_grad()
But still no joy. The error still remains.
TL;DR
Your
args.epoch_iters
is larger than the number of batches inloader_train
. Python raisesStopIteration
error when you ask for more batches than there actually are.When you iterate over some pythonic collection of elements (e.g., list, tuple,
DataLoader
...) python needs to know when it reaches the end of that collection. It is done by raisingStopIteration
exception.for
loop in python explicitly listens to this exception and uses it to know when to stop. Alas, in your code you do not use afor
loop overloader_train
, but rather overrange(args.epoch_iter)
and you usenext(iterator)
to get the batches.