RPC failed with status = "Unavailable: Socket closed" Error when training FairSeq RoBERTa on Cloud TPU using PyTorch

972 Views Asked by At

I followed the tutorials "Pre-training FairSeq RoBERTa on Cloud TPU using Pytorch" to setup a Preemptible (v2-8) TPU env and train my RoBERTa model. The PyTorch env is based on torch-xla-1.6 as instructed by the document. However, it does not output any training log as usual in GPU and it throws the RPC failure warning (see below - network endpoint is removed here) twice in 2-3 days (in 12 hours gap).

My training steps per epoch is 161,529. According to the document, v2-8 will take 80 hours for 5 epochs as i configured. However, My job seems hanging there.

Any advice please ?

 W    4566 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1599580717.037250202","description":"Error received from peer ipv4:<my_network_endpoint>:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
1

There are 1 best solutions below

2
On

It sounds like in this case your TPU may have been getting preempted. Please try using a non-preemptible TPU.