Google AI Platform: The replica master 0 exited with a non-zero status of 127

2.3k Views Asked by At

There's a similar SO question: Tensorflow on ML Engine: The replica master 0 exited with a non-zero status of 1

But here, I'm encountering error "127" instead. Similar to that question, I launched a pytorch custom training container on AI Platform (previously ML Engine) and after about 2 minutes I get the error message "The replica master 0 exited with a non-zero status of 127".

The documentation here doesn't quite say what "127" means: https://cloud.google.com/ai-platform/training/docs/troubleshooting#understanding_training_application_return_codes

Anyone have an idea?

1

There are 1 best solutions below

0
On

In my case, the problem was that I was using CMD instead of ENTRYPOINT in the Dockerfile.

Let's use ENTRYPOINT like this document: Train an ML model with custom containers

#CMD ["python", "trainer/mnist.py"]
# failed -> the replica master 0 exited with a non-zero status of 127

# Try ENTRYPOINT!
ENTRYPOINT ["python", "trainer/mnist.py"]

This solution may not be the cause in your case though, It may be good idea to check if the cause is a Dockerfile or not It may be useful to check the differences between the sample Dockerfile in the above link and your own Dockerfile.