Distributed TensorFlow Using Cloud ML - Training

90 Views Asked by At

I'm trying to run a distributed TF job using Cloud ML.

I tested the code locally (using gcloud ml local command). Here are some specifications:

n_epochs = 20
noofsamples = 55000
batch_size = 100
num_batches = noofsamples/batch_size = 550

Using the following specifications, training steps = n_epochs * num_batches = 11000, which is correct.

But, if I execute the same job in Cloud ML using 2 worker nodes and 1 parameter server, it seems like full training is done on each machine

training steps = 3 machines n_epochs * num_batches = 33000

This shouldn't be the case.

Did any of you encounter this problem?

I would appreciate your help!

Thanks

0

There are 0 best solutions below