SGD and Adam with Weight Decay are the Same as L2 Regularization in PyTorch?

170 Views Asked by Soro At 24 June 2023 at 21:36

For a loss function $f(\theta)$, L2 regularization is given by:

$$\text{L} = f(\theta) + \frac{\lambda}{2}{\theta}^2$$
The parameter update can then be given by:

$$\frac{\partial L}{\partial \theta} =\theta - \alpha \frac{\partial f(\theta)}{\partial \theta} - \alpha {\lambda}{\theta}$$

where $\alpha$ is our learning rate. Now, for weight decay, the parameter update rule is given by

$$\frac{\partial L}{\partial \theta} =\theta - \alpha \frac{\partial f(\theta)}{\partial \theta} - {\lambda^{'}}{\theta}$$

For SGD, L2 regularization and weight decay are identical (more specifically when $\lambda = \frac{\lambda^{'}}{\alpha}$, as best explained in in @tTs's answer [1]). However, for optimizers such as Adam (see [1] and [3] for more in-depth explanations), it is well-known that L2 regularization is not equivalent to weight decay. This is also the case for SGD with momentum.

Currently, I am trying to implement L2 regularization for a vanilla neural network in PyTorch. I am experimenting with SGD (with and without momentum) and Adam. Instead of modifying the loss function at each iteration by looping over all the model parameters, I am instead trying to modify only the gradient update rule in the optimiser class to match with the gradient update rule given in the second equation above (similar to @Szymon Maszke's response in [4]). However, despite other questions such as [5] suggesting that the weight decay implementation for SGD/Adam is not the same as L2 regularization, it seems that the current PyTorch implementation is actually equivalent to L2 regularization (the code snippets for the relevant parts for SGD and Adam in PyTorch are posted below).

So my question is to just confirm my doubts about:

L2 regularization and weight decay actually being the same in the current implementation in PyTorch.

Code snippets:

PyTorch's Adam: (line 314 here)

def _single_tensor_adam(params: List[Tensor],
                        grads: List[Tensor],
                        exp_avgs: List[Tensor],
                        exp_avg_sqs: List[Tensor],
                        max_exp_avg_sqs: List[Tensor],
                        state_steps: List[Tensor],
                        grad_scale: Optional[Tensor],
                        found_inf: Optional[Tensor],
                        *,
                        amsgrad: bool,
                        beta1: float,
                        beta2: float,
                        lr: float,
                        weight_decay: float,
                        eps: float,
                        maximize: bool,
                        capturable: bool,
                        differentiable: bool):

    assert grad_scale is None and found_inf is None

    for i, param in enumerate(params):

        grad = grads[i] if not maximize else -grads[i]
        exp_avg = exp_avgs[i]
        exp_avg_sq = exp_avg_sqs[i]
        step_t = state_steps[i]

        if capturable:
            assert param.is_cuda and step_t.is_cuda, "If capturable=True, params and state_steps must be CUDA tensors."

        # update step
        step_t += 1

        if weight_decay != 0:
            grad = grad.add(param, alpha=weight_decay)

        if torch.is_complex(param):
            grad = torch.view_as_real(grad)
            exp_avg = torch.view_as_real(exp_avg)
            exp_avg_sq = torch.view_as_real(exp_avg_sq)
            param = torch.view_as_real(param)

        # Decay the first and second moment running average coefficient
        exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
        exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
...

PyTorch's SGD (line 232 here):

def _single_tensor_sgd(params: List[Tensor],
                       d_p_list: List[Tensor],
                       momentum_buffer_list: List[Optional[Tensor]],
                       *,
                       weight_decay: float,
                       momentum: float,
                       lr: float,
                       dampening: float,
                       nesterov: bool,
                       maximize: bool,
                       has_sparse_grad: bool):

    for i, param in enumerate(params):
        d_p = d_p_list[i] if not maximize else -d_p_list[i]

        if weight_decay != 0:
            d_p = d_p.add(param, alpha=weight_decay)

        if momentum != 0:
            buf = momentum_buffer_list[i]

            if buf is None:
                buf = torch.clone(d_p).detach()
                momentum_buffer_list[i] = buf
            else:
                buf.mul_(momentum).add_(d_p, alpha=1 - dampening)

            if nesterov:
                d_p = d_p.add(buf, alpha=momentum)
            else:
                d_p = buf

        param.add_(d_p, alpha=-lr)

P.S. Seems like I can't embed images for the equations because I don't have enough reputation :/

Original Q&A

SGD and Adam with Weight Decay are the Same as L2 Regularization in PyTorch?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in DEEP-LEARNING

Related Questions in PYTORCH

Related Questions in REGULARIZED

Trending Questions

Popular # Hahtags

Popular Questions