AWS Sagemaker custom training job container emit loss metric

Question

AWS Sagemaker custom training job container emit loss metric

1.1k Views Asked by jufl At 29 July 2025 at 06:34

I have created a customer docker container using an Amazon tensorflow container as a starting point:

763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-gpu-py36-cu100-ubuntu18.04

inside the container I run a custom keras (with TF backend) training job from the docker SAGEMAKER_PROGRAM. I can access the training data ok (from an EFS mount) and can generate output into /opt/ml/model that gets synced back to S3. So input and output is good: what I am missing is real-time monitoring.

A Sagemaker training job emits system metrics like cpu and gpu loads which you can conveniently view in real-time on the Sagemaker training job console. But I cannot find a way to emit metrics about the progress of the training job. i.e. loss, accuracy etc from my python code.

Actually, ideally I would like to use Tensorboard but as Sagemaker doesn't expose the instance on the EC2 console I cannot see how I can find the IP address of the instance to connect to for Tensorboard.

So the fallback is try and emit relevant metrics from the training code so that we can monitor the job as it runs.

The basic question is how do I real-time monitor key metrics for my custom training job runnning in a container on Sagemaker training job: - Is a tensorboard solution possible? If so how? - If not how do I emit metrics from my python code and have them show up in the training job console or as cloudwatch metrics directly?

BTW: so far I have failed to be able to get sufficient credentials inside the training job container to access either s3 or cloudwatch.

Original Q&A

There are 1 best solutions below

**Sifei** · Answer 1

If you're using customer images for training, you can specify a name and a regular expression for metrics you want to track for training.

byo_estimator = Estimator(image_name=image_name,
                      role='SageMakerRole', train_instance_count=1,
                      train_instance_type='ml.c4.xlarge',
                      sagemaker_session=sagemaker_session,
                      metric_definitions=[{'Name': 'test:msd', 'Regex': '#quality_metric: host=\S+, test msd <loss>=(\S+)'},
                                          {'Name': 'test:ssd', 'Regex': '#quality_metric: host=\S+, test ssd <loss>=(\S+)'}])

AWS Sagemaker custom training job container emit loss metric

There are 1 best solutions below

Related Questions in AMAZON-WEB-SERVICES

Related Questions in DOCKER

Related Questions in TENSORFLOW

Related Questions in AMAZON-SAGEMAKER

Related Questions in AMAZON-CLOUDWATCH-METRICS

Trending Questions

Popular # Hahtags

Popular Questions