I'm trying to run a distributed TF job using Cloud ML.
I tested the code locally (using gcloud ml local command). Here are some specifications:
n_epochs = 20
noofsamples = 55000
batch_size = 100
num_batches = noofsamples/batch_size = 550
Using the following specifications, training steps = n_epochs * num_batches = 11000, which is correct.
But, if I execute the same job in Cloud ML using 2 worker nodes and 1 parameter server, it seems like full training is done on each machine
training steps = 3 machines n_epochs * num_batches = 33000
This shouldn't be the case.
Did any of you encounter this problem?
I would appreciate your help!
Thanks