GCP AI platform unified: Cannot find bucket when using parser/ command line arguments

81 Views Asked by At

I am having this weird bug that the submitted custom job fails due to not finding the bucket I defined for the training output although I can see it exists. This is the error I am getting:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 508, in <module>
    train_loss = train(scheduler, optimizer)
  File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 390, in train
    torch.save(model.state_dict(), args.model_dir)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 369, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'gs://machine-learning-us-central1/test_04_30_15_12'

The bucket and directory in the error gs://machine-learning-us-central1/test_04_30_15_12 do exist. They are also being created before the actual training.

It works when I do not use command line arguments but python code. Thus, I assume I have a bug within the parser which I used for the command line arguments. Parser:

parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', dest='model_dir',
                    help='Model dir.')
parser.add_argument('--model-name', dest='model_name',
                help='Name of the model',
                default='model.pt')
args = parser.parse_args()

How I store the training output:

def save_model(args_dir, args_name):
    """Saves the model to Google Cloud Storage

    Args:
      args: contains name for saved model.
    """
    bucket = storage.Client().bucket(ROOT_BUCKET)
    blob = bucket.blob(args_name)
    blob.upload_from_filename(args_dir)  

#within the training run
torch.save(model.state_dict(), args.model_name)
save_model(args.model_name, args.model_dir)

Model dir seems to be correctly defined as I can store files in it, just not with the parser.

Or could it be that there is an issue with pytorch and how I save the model?

0

There are 0 best solutions below