Amazon SageMaker multi GPU: No objective found

312 Views Asked by At

I have a question on Sagemaker multi GPU - IHAC running their code in single gpu instances (ml.p3.2xlarge) but when they select ml.p3.8xlarge(multi gpu), it is running into the following error:

“Failure reason: No objective metrics found after running 5 training jobs. Please ensure that the custom algorithm is emitting the objective metric as defined by the regular expression provided.”

Their code handles multi gpu usage and currently works well on their machine outside of AWS. Do you have any documentation that you can point me to help them address the problem? They are currently using PyTorch for all of their model development.

1

There are 1 best solutions below

0
juvchan On

Looks like they are running Hyperparameter Optimization (HPO) on Sagemaker and no metrics is being emitted by their code that allows HPO to tune. It is a problem with how they specify regular expression objective metric, for more details see SageMaker Estimator Metrics Definitions.

Essentially use a tool like https://regex101.com to validate the regex they use extracts the objective number from their training logs.