Neural network with batch training algorithm, when to apply momentum and weight decay

Question

Neural network with batch training algorithm, when to apply momentum and weight decay

1.8k Views Asked by mp85 At 03 June 2025 at 14:41

I built a neural network and successfully trained it by using backpropagation with stochastic gradient descent. Now I'm switching to batch training but I'm a bit confused about when to apply momentum and weight decay. I know fair well how backpropagation works in theory, I'm just stuck with implementation details. With the stochastic approach, all I had to do was apply the updates to the weight immediately after having computed the gradients, as in this pseudo python code:

for epoch in epochs:
    for p in patterns:
        outputs = net.feedforward(p.inputs)
        # output_layer_errors is needed to plot the error
        output_layer_errors = net.backpropagate(outputs, p.targets)
        net.update_weights()

where update_weights method is defined as follows:

def update_weights(self):
    for h in self.hidden_neurons:
        for o in self.output_neurons:
            gradient = h.output * o.error
            self.weights[h.index][o.index] += self.learning_rate * gradient + \
                                              self.momentum * self.prev_gradient
            self.weights[h.index][o.index] -= self.decay * self.weights[h.index][o.index]

    for i in self.input_neurons:
        for h in self.hidden_neurons:
            gradient = i.output * h.error
            self.weights[i.index][h.index] += self.learning_rate * gradient + \
                                              self.momentum * self.prev_gradient
            self.weights[i.index][h.index] -= self.decay * self.weights[i.index][h.index]

This works like a charm (note that there might be errors because i'm just using python because it's more understandable, the actual net is coded in C. This code is just to show the steps i did to compute the updates). Now, switching to batch updates, the main algorithm should be something like:

for epoch in epochs:
    for p in patterns:
        outputs = net.feedforward(p.inputs)
        # output_layer_errors is needed to plot the error
        output_layer_errors = net.backpropagate(outputs, p.targets)
        net.accumulate_updates()
    net.update_weights()

the accumulate method is as follows:

def accumulate_weights(self):
    for h in self.hidden_neurons:
        for o in self.output_neurons:
            gradient = h.output * o.error
            self.accumulator[h.index][o.index] += self.learning_rate * gradient
            # should I compute momentum here?
    for i in self.input_neurons:
        for h in self.hidden_neurons:
            gradient = i.output * h.error
            # should I just accumulate the gradient without scaling it by the learning rate here?
            self.accumulator[i.index][h.index] = self.learning_rate * gradient
            # should I compute momentum here?

and the update_weights is like this:

def update_weights(self):
    for h in self.hidden_neurons:
        for o in self.output_neurons:
            # what to do here? apply momentum? apply weight decay?
            self.weights[h.index][o.index] += self.accumulator[h.index][o.index]
            self.accumulator[h.index][o.index] = 0.0

    for i in self.input_neurons:
        for h in self.hidden_neurons:
            # what to do here? apply momentum? apply weight decay?
            self.weights[i.index][h.index] += self.accumulator[i.index][h.index]
            self.accumulator[i.index][h.index] = 0.0

I'm not sure if I have to:

1) scale the gradient with the learning rate at the time of accumulation or at the time of update

2) apply momentum at the time accumulation of at the time of update

3) same as 2) but for weight decay

Can somebody help me solve this issue? I'm sorry for the long question, but I thought I would be detailed to explain my doubts better.

Original Q&A

There are 2 best solutions below

**ASantosRibeiro** · Answer 1

Just some quick comment to this. Stochastic gradient descendent leads most of the times to a non-smooth optimization, and requires a sequential optimization that does not suit current technology advances such as parallel computation.

As such, the mini-batch approach try to gain the advantages of the stochastic optimization with the advantages of the batch optimization (parallel computation). Here what you do is to create small training blocks which you give in a parallel fashion to the learning algorithm. At the end each worker should tell you the error to their training sample, which you can average and use as in a normal stochastic gradient descendent.

This approach lead to a much smoother optimization, and probably to a quicker optimization if you make use of parallel computing.

**firstprayer** · Answer 2

It seems for the first question either is fine. But if you want to combine with momentum, you better check the original formula in your implementation. I would say you should not scale the gradient during the accumulation. At the time when computing momentum, use the formula:

v_{t+1} = \mu v_t - \alpha * g_t

where g_t is the gradient. alpha is learning rate.

I also recommend using AdaGrad and mini-batch instead of full-batch.

Reference: http://firstprayer.github.io/stochastic-gradient-descent-tricks/

Neural network with batch training algorithm, when to apply momentum and weight decay

There are 2 best solutions below

Related Questions in ALGORITHM

Related Questions in MACHINE-LEARNING

Related Questions in NEURAL-NETWORK

Related Questions in BACKPROPAGATION

Trending Questions

Popular # Hahtags

Popular Questions