How to compute Hessian of the loss w.r.t. the parameters in PyTorch using autograd.grad

9.9k Views Asked by At

I know there is quite a bit of content out there about "computing the Hessian" in pytorch, but as far as I've seen I haven't found anything working for me. So to try to be most precise, the Hessian that I want is the Jacobian of the gradient of the loss with respect to the network parameters. Also called the matrix of second-order derivatives with respect to the parameters.

I found some code that works in an intuitive way, although shouldn't be fast. It clearly just computes the gradient of the gradient of the loss w.r.t. the params w.r.t the params, and it does it one element (of the gradient) at a time. I think the logic is definitely right but I am getting an error, having to do with requires_grad. I'm a pytorch beginner so maybe its a simple thing, but the error seems to be saying that it can't take the gradient of the env_grads variable, which is the output from the previous grad function call.

Any help with this would be greatly appreciated. Here is the code followed by the error message. I also printed out the env_grads[0] variable so we can see that it is in fact a tensor, which is the correct output from the previous grad call.

env_loss = loss_fn(env_outputs, env_targets)
total_loss += env_loss
env_grads = torch.autograd.grad(env_loss, params,retain_graph=True)

print(env_grads[0])
hess_params = torch.zeros_like(env_grads[0])
for i in range(env_grads[0].size(0)):
    for j in range(env_grads[0].size(1)):
        hess_params[i, j] = torch.autograd.grad(env_grads[0][i][j], params, retain_graph=True)[0][i, j] #  <--- error here
print(hess_params)

Output:

tensor([[-6.4064e-03, -3.1738e-03,  1.7128e-02,  8.0391e-03],
        [ 7.1698e-03, -2.4640e-03, -2.2769e-03, -1.0687e-03],
        [-3.0390e-04, -2.4273e-03, -4.0799e-02, -1.9149e-02],
        ...,
        [ 1.1258e-02, -2.5911e-05, -9.8133e-02, -4.6059e-02],
        [ 8.1502e-04, -2.5814e-03,  4.1772e-02,  1.9606e-02],
        [-1.0075e-02,  6.6072e-03,  8.3118e-04,  3.9011e-04]], device='cuda:0')

Error:

Traceback (most recent call last):
  File "/home/jefferythewind/anaconda3/envs/rapids3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/jefferythewind/anaconda3/envs/rapids3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jefferythewind/Projects/Irina/learning-explanations-hard-to-vary/and_mask/run_synthetic.py", line 258, in <module>
    main(args)
  File "/home/jefferythewind/Projects/Irina/learning-explanations-hard-to-vary/and_mask/run_synthetic.py", line 245, in main
    deep_mask=args.deep_mask
  File "/home/jefferythewind/Projects/Irina/learning-explanations-hard-to-vary/and_mask/run_synthetic.py", line 103, in train
    scale_grad_inverse_sparsity=scale_grad_inverse_sparsity
  File "/home/jefferythewind/Projects/Irina/learning-explanations-hard-to-vary/and_mask/and_mask_utils.py", line 154, in get_grads_deep
    hess_params[i, j] = torch.autograd.grad(env_grads[0][i][j], params, retain_graph=True)[0][i, j]
  File "/home/jefferythewind/anaconda3/envs/rapids3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 157, in grad
    inputs, allow_unused)
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
2

There are 2 best solutions below

0
On

Don't mind if I answer my own question here. I found the hint that I needed in this thread from the PyTorch site, about half way down.

That won’t work and would fit the 3rd point I mentioned. You would need to create the computation graph with differentiable operations, which will create a result tensor with a valid grad_fn.

I noticed there is an argument for autograd.grad called create_graph so I set this to True in the first call to grad, and that ended solving the error.

Modified working code:

env_loss = loss_fn(env_outputs, env_targets)
total_loss += env_loss
env_grads = torch.autograd.grad(env_loss, params, retain_graph=True, create_graph=True)

print(env_grads[0])
hess_params = torch.zeros_like(env_grads[0])
for i in range(env_grads[0].size(0)):
    for j in range(env_grads[0].size(1)):
        hess_params[i, j] = torch.autograd.grad(env_grads[0][i][j], params, retain_graph=True)[0][i, j] #  <--- error here
print(hess_params)
0
On

PyTorch recently-ish added a functional higher level API to torch.autograd which provides torch.autograd.functional.hessian(func, inputs,...) to directly evaluate the hessian of the scalar function func with respect to its arguments at a location specified by inputs, a tuple of tensors corresponding to the arguments of func. hessian itself does not support automatic differentiation, I believe.

Note, however, that as of March 2021 it is still in beta.


Full example using torch.autograd.functional.hessian to create a score-test for non-zero mean (As a (bad) alternative to the one sample t-test):

import numpy as np
import torch, torchvision
from torch.autograd import Variable, grad
import torch.distributions as td
import math
from torch.optim import Adam
import scipy.stats


x_data = torch.randn(100)+0.0 # observed data (here sampled under H0)

N = x_data.shape[0] # number of observations

mu_null = torch.zeros(1)
sigma_null_hat = Variable(torch.ones(1), requires_grad=True)

def log_lik(mu, sigma):
  return td.Normal(loc=mu, scale=sigma).log_prob(x_data).sum()

# Find theta_null_hat by some gradient descent algorithm (in this case an closed-form expression would be trivial to obtain (see below)):
opt = Adam([sigma_null_hat], lr=0.01)
for epoch in range(2000):
    opt.zero_grad() # reset gradient accumulator or optimizer
    loss = - log_lik(mu_null, sigma_null_hat) # compute log likelihood with current value of sigma_null_hat  (= Forward pass)
    loss.backward() # compute gradients (= Backward pass)
    opt.step()      # update sigma_null_hat
    
print(f'parameter fitted under null: sigma: {sigma_null_hat}, expected: {torch.sqrt((x_data**2).mean())}')
#> parameter fitted under null: sigma: tensor([0.9260], requires_grad=True), expected: 0.9259940385818481

theta_null_hat = (mu_null, sigma_null_hat)

U = torch.tensor(torch.autograd.functional.jacobian(log_lik, theta_null_hat)) # Jacobian (= vector of partial derivatives of log likelihood w.r.t. the parameters (of the full/alternative model)) = score
I = -torch.tensor(torch.autograd.functional.hessian(log_lik, theta_null_hat)) / N # estimate of the Fisher information matrix
S = torch.t(U) @ torch.inverse(I) @ U / N # test statistic, often named "LM" (as in Lagrange multiplier), would be zero at the maximum likelihood estimate

pval_score_test = 1 - scipy.stats.chi2(df = 1).cdf(S) # S asymptocially follows a chi^2 distribution with degrees of freedom equal to the number of parameters fixed under H0
print(f'p-value Chi^2-based score test: {pval_score_test}')
#> p-value Chi^2-based score test: 0.9203232752568568

# comparison with Student's t-test:
pval_t_test = scipy.stats.ttest_1samp(x_data, popmean = 0).pvalue
print(f'p-value Student\'s t-test: {pval_t_test}')
#> p-value Student's t-test: 0.9209265268946605