I built a neural network and successfully trained it by using backpropagation with stochastic gradient descent. Now I'm switching to batch training but I'm a bit confused about when to apply momentum and weight decay. I know fair well how backpropagation works in theory, I'm just stuck with implementation details. With the stochastic approach, all I had to do was apply the updates to the weight immediately after having computed the gradients, as in this pseudo python code:
for epoch in epochs:
for p in patterns:
outputs = net.feedforward(p.inputs)
# output_layer_errors is needed to plot the error
output_layer_errors = net.backpropagate(outputs, p.targets)
net.update_weights()
where update_weights method is defined as follows:
def update_weights(self):
for h in self.hidden_neurons:
for o in self.output_neurons:
gradient = h.output * o.error
self.weights[h.index][o.index] += self.learning_rate * gradient + \
self.momentum * self.prev_gradient
self.weights[h.index][o.index] -= self.decay * self.weights[h.index][o.index]
for i in self.input_neurons:
for h in self.hidden_neurons:
gradient = i.output * h.error
self.weights[i.index][h.index] += self.learning_rate * gradient + \
self.momentum * self.prev_gradient
self.weights[i.index][h.index] -= self.decay * self.weights[i.index][h.index]
This works like a charm (note that there might be errors because i'm just using python because it's more understandable, the actual net is coded in C. This code is just to show the steps i did to compute the updates). Now, switching to batch updates, the main algorithm should be something like:
for epoch in epochs:
for p in patterns:
outputs = net.feedforward(p.inputs)
# output_layer_errors is needed to plot the error
output_layer_errors = net.backpropagate(outputs, p.targets)
net.accumulate_updates()
net.update_weights()
the accumulate method is as follows:
def accumulate_weights(self):
for h in self.hidden_neurons:
for o in self.output_neurons:
gradient = h.output * o.error
self.accumulator[h.index][o.index] += self.learning_rate * gradient
# should I compute momentum here?
for i in self.input_neurons:
for h in self.hidden_neurons:
gradient = i.output * h.error
# should I just accumulate the gradient without scaling it by the learning rate here?
self.accumulator[i.index][h.index] = self.learning_rate * gradient
# should I compute momentum here?
and the update_weights is like this:
def update_weights(self):
for h in self.hidden_neurons:
for o in self.output_neurons:
# what to do here? apply momentum? apply weight decay?
self.weights[h.index][o.index] += self.accumulator[h.index][o.index]
self.accumulator[h.index][o.index] = 0.0
for i in self.input_neurons:
for h in self.hidden_neurons:
# what to do here? apply momentum? apply weight decay?
self.weights[i.index][h.index] += self.accumulator[i.index][h.index]
self.accumulator[i.index][h.index] = 0.0
I'm not sure if I have to:
1) scale the gradient with the learning rate at the time of accumulation or at the time of update
2) apply momentum at the time accumulation of at the time of update
3) same as 2) but for weight decay
Can somebody help me solve this issue? I'm sorry for the long question, but I thought I would be detailed to explain my doubts better.
Just some quick comment to this. Stochastic gradient descendent leads most of the times to a non-smooth optimization, and requires a sequential optimization that does not suit current technology advances such as parallel computation.
As such, the mini-batch approach try to gain the advantages of the stochastic optimization with the advantages of the batch optimization (parallel computation). Here what you do is to create small training blocks which you give in a parallel fashion to the learning algorithm. At the end each worker should tell you the error to their training sample, which you can average and use as in a normal stochastic gradient descendent.
This approach lead to a much smoother optimization, and probably to a quicker optimization if you make use of parallel computing.