ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream

333 Views Asked by MAC At 05 June 2025 at 22:17

I am doing a distributed training using GCP Vertex platform. The model is trained in parallel using 4 GPU's using Pytorch and HuggingFace. After training when I save the model from local container to GCP bucket it throws me the error.

Here is the code:

I launch the train.py this way:

python -m torch.distributed.launch --nproc_per_node 4  train.py

After training is complete I save model files using this. It has 3 files that needs to be saved.

trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0  cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP

Error:

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded

And sometimes I get this error:

ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.

Original Q&A

There are 2 best solutions below

Sandeep Vokkareni On 18 April 2022 at 07:28

As per the documentation name conflict, you are trying to overwrite a file that has already been created.

So I would recommand you to change the destiny location with a unique identifier per training so you don't receive this type of error. For example, adding the timestamp in string format at the end of your bucket like:

- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000

I would like to mention that this kind of error is retryable as mentioned in the error documentation error docs.

blake On 25 April 2023 at 11:46

I encountered this issue as well. It appears that this happens when the file contents change whilst rsync is uploading the file. This can happen for large files since file writes are not guaranteed to be transactional.

I got around the issue by simply retrying the gsutil rsync command.

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream

There are 2 best solutions below

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in PYTORCH

Related Questions in GSUTIL

Related Questions in GCP-AI-PLATFORM-TRAINING

Trending Questions

Popular # Hahtags

Popular Questions