There's a similar SO question: Tensorflow on ML Engine: The replica master 0 exited with a non-zero status of 1
But here, I'm encountering error "127" instead. Similar to that question, I launched a pytorch custom training container on AI Platform (previously ML Engine) and after about 2 minutes I get the error message "The replica master 0 exited with a non-zero status of 127".
The documentation here doesn't quite say what "127" means: https://cloud.google.com/ai-platform/training/docs/troubleshooting#understanding_training_application_return_codes
Anyone have an idea?
In my case, the problem was that I was using
CMD
instead ofENTRYPOINT
in the Dockerfile.Let's use
ENTRYPOINT
like this document: Train an ML model with custom containersThis solution may not be the cause in your case though, It may be good idea to check if the cause is a Dockerfile or not It may be useful to check the differences between the sample Dockerfile in the above link and your own Dockerfile.