I am having this weird bug that the submitted custom job fails due to not finding the bucket I defined for the training output although I can see it exists. This is the error I am getting:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 508, in <module>
train_loss = train(scheduler, optimizer)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 390, in train
torch.save(model.state_dict(), args.model_dir)
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 369, in save
with _open_file_like(f, 'wb') as opened_file:
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'gs://machine-learning-us-central1/test_04_30_15_12'
The bucket and directory in the error gs://machine-learning-us-central1/test_04_30_15_12
do exist. They are also being created before the actual training.
It works when I do not use command line arguments but python code. Thus, I assume I have a bug within the parser which I used for the command line arguments. Parser:
parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', dest='model_dir',
help='Model dir.')
parser.add_argument('--model-name', dest='model_name',
help='Name of the model',
default='model.pt')
args = parser.parse_args()
How I store the training output:
def save_model(args_dir, args_name):
"""Saves the model to Google Cloud Storage
Args:
args: contains name for saved model.
"""
bucket = storage.Client().bucket(ROOT_BUCKET)
blob = bucket.blob(args_name)
blob.upload_from_filename(args_dir)
#within the training run
torch.save(model.state_dict(), args.model_name)
save_model(args.model_name, args.model_dir)
Model dir seems to be correctly defined as I can store files in it, just not with the parser.
Or could it be that there is an issue with pytorch and how I save the model?