Google AI Platform: The replica master 0 exited with a non-zero status of 127

2.3k Views Asked by foobar At 28 September 2020 at 17:20

But here, I'm encountering error "127" instead. Similar to that question, I launched a pytorch custom training container on AI Platform (previously ML Engine) and after about 2 minutes I get the error message "The replica master 0 exited with a non-zero status of 127".

The documentation here doesn't quite say what "127" means: https://cloud.google.com/ai-platform/training/docs/troubleshooting#understanding_training_application_return_codes

Anyone have an idea?

Original Q&A

There are 1 best solutions below

masaya On 17 November 2022 at 11:02

In my case, the problem was that I was using CMD instead of ENTRYPOINT in the Dockerfile.

Let's use ENTRYPOINT like this document: Train an ML model with custom containers

#CMD ["python", "trainer/mnist.py"]
# failed -> the replica master 0 exited with a non-zero status of 127

# Try ENTRYPOINT!
ENTRYPOINT ["python", "trainer/mnist.py"]

This solution may not be the cause in your case though, It may be good idea to check if the cause is a Dockerfile or not It may be useful to check the differences between the sample Dockerfile in the above link and your own Dockerfile.

Google AI Platform: The replica master 0 exited with a non-zero status of 127

There are 1 best solutions below

Related Questions in GOOGLE-CLOUD-ML

Trending Questions

Popular # Hahtags

Popular Questions