Cloud ML Engine distributed training default type for custom tf.estimator

312 Views Asked by At

This article suggests there are three options for distributed training

  1. Data-parallel training with synchronous updates.
  2. Data-parallel training with asynchronous updates.
  3. Model-parallel training.

The tutorial then goes on to suggest that the code that follows performs data-parallel training with asynchronous updates on Cloud ML Engine which behaves as "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches."

However, it's not clear what portion of the code actually specifies that this is using data-parallel training with asynchronous updates. Is this simply the default for ML engine if you run it in distributed training mode with a custom tf.estimator?

2

There are 2 best solutions below

3
On BEST ANSWER

The short answer is that tf.estimator is currently mostly built around Data-parallel training (2).

You get Model-parallel training simply by using with tf.device() statements in your code.

You could try to use SyncReplicasOptimizer and probably accomplish synchronous training (1).

All of the above applies generally to tf.estimator; nothing is different for CloudML Engine.

2
On

Cloud ML Engine doesn't determine the mode of distributed training. This depends on how the user sets up training using the TensorFlow libraries. In the mnist example linked from the article, the code is using TF Learn classes specifically an Estimator is constructed in model.py

That code selects the optimizer which in this case is the AdamOptimizer which uses asynchronous updates. If you wanted to do synchronous updates you'd have to use a different optimizer such as SyncReplicasOptimizer.

For more information on how to setup synchronous training you can refer to this doc.