I am confused about minimize function. Eg. a distance variable X with shape [mini_batch_size, 1],
loss_1 = tf.reduce_mean(X),
loss_2 = X
then minimize(loss_1) is mini-batch gradient descent, but how about minimize(loss_2)? element-wise updating? If so, is it exactly the same as stochastic gradient descent?
Actually this is a very technical thing in TF. loss_2 is.... equivalent to loss_1 up to the multiplication by the constant. It is not "SGD" as other answers suggest - this is not how TF works; it is also a mini batch update, and the only difference from loss_1 is that it is multiplied by batch_size, that's it.
The crucial element is hidden in a way tf.gradients is implemented. Namely, it expects scalar function to be passed as a first argument. However, if you pass multiple values it does not throw an error, instead it just sums them. You can find this information in official TF documentation of tf.gradients:
So in fact your loss_2 is equivalent to:
and obviously the only difference from loss_1 is not dividing by batch_size. Nothing else.
Prints:
and as expected, g and g2 are the same, while g1 is just g (or g2) divided by 3 (batch_size)