I have created a customer docker container using an Amazon tensorflow container as a starting point:
763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-gpu-py36-cu100-ubuntu18.04
inside the container I run a custom keras (with TF backend) training job from the docker SAGEMAKER_PROGRAM. I can access the training data ok (from an EFS mount) and can generate output into /opt/ml/model that gets synced back to S3. So input and output is good: what I am missing is real-time monitoring.
A Sagemaker training job emits system metrics like cpu and gpu loads which you can conveniently view in real-time on the Sagemaker training job console. But I cannot find a way to emit metrics about the progress of the training job. i.e. loss, accuracy etc from my python code.
Actually, ideally I would like to use Tensorboard but as Sagemaker doesn't expose the instance on the EC2 console I cannot see how I can find the IP address of the instance to connect to for Tensorboard.
So the fallback is try and emit relevant metrics from the training code so that we can monitor the job as it runs.
The basic question is how do I real-time monitor key metrics for my custom training job runnning in a container on Sagemaker training job: - Is a tensorboard solution possible? If so how? - If not how do I emit metrics from my python code and have them show up in the training job console or as cloudwatch metrics directly?
BTW: so far I have failed to be able to get sufficient credentials inside the training job container to access either s3 or cloudwatch.
If you're using customer images for training, you can specify a name and a regular expression for metrics you want to track for training.