For a loss function $f(\theta)$, L2 regularization is given by:
$$\text{L} = f(\theta) + \frac{\lambda}{2}{\theta}^2$$
The parameter update can then be given by:
$$\frac{\partial L}{\partial \theta} =\theta - \alpha \frac{\partial f(\theta)}{\partial \theta} - \alpha {\lambda}{\theta}$$
where $\alpha$ is our learning rate. Now, for weight decay, the parameter update rule is given by
$$\frac{\partial L}{\partial \theta} =\theta - \alpha \frac{\partial f(\theta)}{\partial \theta} - {\lambda^{'}}{\theta}$$
For SGD, L2 regularization and weight decay are identical (more specifically when $\lambda = \frac{\lambda^{'}}{\alpha}$, as best explained in in @tTs's answer [1]). However, for optimizers such as Adam (see [1] and [3] for more in-depth explanations), it is well-known that L2 regularization is not equivalent to weight decay. This is also the case for SGD with momentum.
Currently, I am trying to implement L2 regularization for a vanilla neural network in PyTorch. I am experimenting with SGD (with and without momentum) and Adam. Instead of modifying the loss function at each iteration by looping over all the model parameters, I am instead trying to modify only the gradient update rule in the optimiser class to match with the gradient update rule given in the second equation above (similar to @Szymon Maszke's response in [4]). However, despite other questions such as [5] suggesting that the weight decay implementation for SGD/Adam is not the same as L2 regularization, it seems that the current PyTorch implementation is actually equivalent to L2 regularization (the code snippets for the relevant parts for SGD and Adam in PyTorch are posted below).
So my question is to just confirm my doubts about:
L2 regularization and weight decay actually being the same in the current implementation in PyTorch.
Code snippets:
PyTorch's Adam: (line 314 here)
def _single_tensor_adam(params: List[Tensor],
grads: List[Tensor],
exp_avgs: List[Tensor],
exp_avg_sqs: List[Tensor],
max_exp_avg_sqs: List[Tensor],
state_steps: List[Tensor],
grad_scale: Optional[Tensor],
found_inf: Optional[Tensor],
*,
amsgrad: bool,
beta1: float,
beta2: float,
lr: float,
weight_decay: float,
eps: float,
maximize: bool,
capturable: bool,
differentiable: bool):
assert grad_scale is None and found_inf is None
for i, param in enumerate(params):
grad = grads[i] if not maximize else -grads[i]
exp_avg = exp_avgs[i]
exp_avg_sq = exp_avg_sqs[i]
step_t = state_steps[i]
if capturable:
assert param.is_cuda and step_t.is_cuda, "If capturable=True, params and state_steps must be CUDA tensors."
# update step
step_t += 1
if weight_decay != 0:
grad = grad.add(param, alpha=weight_decay)
if torch.is_complex(param):
grad = torch.view_as_real(grad)
exp_avg = torch.view_as_real(exp_avg)
exp_avg_sq = torch.view_as_real(exp_avg_sq)
param = torch.view_as_real(param)
# Decay the first and second moment running average coefficient
exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
...
PyTorch's SGD (line 232 here):
def _single_tensor_sgd(params: List[Tensor],
d_p_list: List[Tensor],
momentum_buffer_list: List[Optional[Tensor]],
*,
weight_decay: float,
momentum: float,
lr: float,
dampening: float,
nesterov: bool,
maximize: bool,
has_sparse_grad: bool):
for i, param in enumerate(params):
d_p = d_p_list[i] if not maximize else -d_p_list[i]
if weight_decay != 0:
d_p = d_p.add(param, alpha=weight_decay)
if momentum != 0:
buf = momentum_buffer_list[i]
if buf is None:
buf = torch.clone(d_p).detach()
momentum_buffer_list[i] = buf
else:
buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
if nesterov:
d_p = d_p.add(buf, alpha=momentum)
else:
d_p = buf
param.add_(d_p, alpha=-lr)
P.S. Seems like I can't embed images for the equations because I don't have enough reputation :/