I try to replicate a solution for a GP regression in the sklearn implementation with a GPyTorch version. Unfortunately, I cannot give an example with the original dataset, which is proprietory. The sklearn solution has consistently have a 40% lower test mse than the pytorch version, but I am also not very familiar at all with torch. There is another question that deals with a similar problem, it seems to be a particular issue about multi-output GP regression, which does not seem to apply here.
I tried to replicate the problem with an artificial dataset that is quite different from the original data:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import numpy as np
import matplotlib.pyplot as plt
train_size = 1000
test_size = 200
X_train = np.random.normal(loc = 0, scale = 5, size = (train_size, 6))
y_train = X_train[:,0] + 2*X_train[:,1]-3*X_train[:,2]+4*X_train[:,3] - 5*X_train[:,4] + 6*X_train[:,5] + np.random.normal(loc = 0, scale = 0.3, size = train_size)
X_test = np.random.normal(loc = 0, scale = 5, size = (test_size, 6))
y_test = X_test[:,0] + 2*X_test[:,1]-3*X_test[:,2]+4*X_test[:,3] - 5*X_test[:,4] + 6*X_test[:,5] + np.random.normal(loc = 0, scale = 0.3, size = test_size)
naive_prediction = y_train.mean()
basic_mse = mse(y_test, np.repeat(naive_prediction, test_size) )
# Sklearn GP regression
import sklearn.gaussian_process as gp
kernel = gp.kernels.ConstantKernel(1.0, (1e-1, 1e3)) * gp.kernels.RBF(10.0, (1e-3, 1e3))
model = gp.GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=0.1, normalize_y=False)
model.fit(X_train, y_train)
params = model.kernel_.get_params()
sk_pred, std = model.predict(X_test, return_std=True)
sk_mse = mse(sk_pred, y_test)
print('Sklearn MSE reduction : ', sk_mse/basic_mse)
print('Sklearn Lengthscale: ', params['k2__length_scale'])
# GPytorch Regression
import torch
import gpytorch
X_train_tensor = torch.Tensor(X_train)
y_train_tensor = torch.Tensor(y_train)
X_test_tensor = torch.Tensor(X_test)
y_test_tensor = torch.Tensor(y_test)
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(X_train_tensor, y_train_tensor, likelihood)
training_iter = 1000
model.train()
likelihood.train()
# Use the adam optimizer
optimizer = torch.optim.Adam([
{'params': model.parameters()}, # Includes GaussianLikelihood parameters
], lr=0.1)
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
all_losses = []
for i in range(training_iter):
# Zero gradients from previous iteration
optimizer.zero_grad()
# Output from model
output = model(X_train_tensor)
# Calc loss and backprop gradients
loss = -mll(output, y_train_tensor)
loss.backward()
all_losses.append(loss.item())
optimizer.step()
plt.plot(all_losses)
model.eval()
likelihood.eval()
with torch.no_grad(), gpytorch.settings.fast_pred_var():
prediction = likelihood(model(X_test_tensor))
mean = prediction.mean
# Get lower and upper predictive bounds
lower, upper = prediction.confidence_region()
noise = model.likelihood.noise.item()
length_scale = model.covar_module.base_kernel.lengthscale.item()
torch_mse = mse(mean.numpy(), y_test)
print('Torch MSE reduction : ', torch_mse/basic_mse)
print('Torch Lengthscale: ', length_scale)
print('Torch Noise: ', noise)
Incidentally, in this example GPytorch works better. More specifically, I am trying to understand the difference in the noise variance. By trying, I figured that the alpha parameter, which essentially sets the noise level on the diagonal of the covariance, has a drastic impact on how good the test mse is. If I understand it correctly, it is this parameter that is in torch inferred via backprop. Apart from that, the two kernels seem identical, provided the torch variant is allowed to sufficiently relax into a good solution.
My questions are:
How can I set and fixate the noise covariance on the pytorch kernel?
The inferred length scale seems so be quite different in between the different implementations, which is especially true on the proprietory data set of the original implementation. How could that be explained?
Are there any other factors that could explain a difference in performance? What aspects of the data would amplify such a difference?