I've successfully trained models using Tensorflow's Object Detection API running both locally on GPU (using model_main.py
) and using Google's ML Engine (both GPU and TPU). However, I can't seem to be able to use model_tpu_main.py
to train a model, when running on on Google's Cloud (using a manually provisionned VM and TPU).
When I launch model_tpu_main.py
using something like python -m object_detection.model_tpu_main --model_dir=gs://bucket/training --tpu_zone us-central1-b --pipeline_config_path=gs://bucket/training/pipeline.config --job-dir gs://bucket/training --tpu_name mytpu_name
, it gets stuck on:
...
W1113 03:05:16.628712 139998232708864 variables_helper.py:144] Variable [resnet_v1_50/fpn/smoothing_2/BatchNorm/moving_mean] is not available in checkpoint
W1113 03:05:16.629062 139998232708864 variables_helper.py:144] Variable [resnet_v1_50/fpn/smoothing_2/BatchNorm/moving_variance] is not available in checkpoint
W1113 03:05:16.629330 139998232708864 variables_helper.py:144] Variable [resnet_v1_50/fpn/smoothing_2/weights] is not available in checkpoint
2018-11-13 03:06:08.618834: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
...
Looking at the TPU logs, pretty much all I get is:
...
Start master session b9186abfa4e15b1d with config: isolate_session_state: true A
Start master session 48b812f9ca0d3ebf with config: isolate_session_state: true A
Start master session 33048226cb131f4c with config: isolate_session_state: true A
Start master session cab95e277a429f9d with config: isolate_session_state: true A
Start master session 56b5d3296c9bfe15 with config: isolate_session_state: true A
Start master session 3fdac64b285c365d with config: isolate_session_state: true A
Start master session ec1fa14806ad9351 with config: isolate_session_state: true A
...
Any idea what I'm doing wrong?