Sklearn Gaussian Process and GPytorch give different results

1k Views Asked by AJK At 15 August 2025 at 21:57

I try to replicate a solution for a GP regression in the sklearn implementation with a GPyTorch version. Unfortunately, I cannot give an example with the original dataset, which is proprietory. The sklearn solution has consistently have a 40% lower test mse than the pytorch version, but I am also not very familiar at all with torch. There is another question that deals with a similar problem, it seems to be a particular issue about multi-output GP regression, which does not seem to apply here.

I tried to replicate the problem with an artificial dataset that is quite different from the original data:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import numpy as np
import matplotlib.pyplot as plt

train_size = 1000
test_size = 200
X_train = np.random.normal(loc = 0, scale = 5, size = (train_size, 6))
y_train = X_train[:,0] + 2*X_train[:,1]-3*X_train[:,2]+4*X_train[:,3] - 5*X_train[:,4] + 6*X_train[:,5] + np.random.normal(loc = 0, scale = 0.3, size = train_size)

X_test = np.random.normal(loc = 0, scale = 5, size = (test_size, 6))
y_test = X_test[:,0] + 2*X_test[:,1]-3*X_test[:,2]+4*X_test[:,3] - 5*X_test[:,4] + 6*X_test[:,5] + np.random.normal(loc = 0, scale = 0.3, size = test_size)

naive_prediction = y_train.mean()
basic_mse = mse(y_test, np.repeat(naive_prediction, test_size) )


# Sklearn GP regression
import sklearn.gaussian_process as gp
kernel = gp.kernels.ConstantKernel(1.0, (1e-1, 1e3)) * gp.kernels.RBF(10.0, (1e-3, 1e3))
model = gp.GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=0.1, normalize_y=False)

model.fit(X_train, y_train)
params = model.kernel_.get_params()

sk_pred, std = model.predict(X_test, return_std=True)
sk_mse = mse(sk_pred, y_test)
print('Sklearn MSE reduction : ', sk_mse/basic_mse)
print('Sklearn Lengthscale: ', params['k2__length_scale'])

# GPytorch Regression
import torch
import gpytorch

X_train_tensor = torch.Tensor(X_train)
y_train_tensor = torch.Tensor(y_train)

X_test_tensor = torch.Tensor(X_test)
y_test_tensor = torch.Tensor(y_test)


class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(X_train_tensor, y_train_tensor, likelihood)


training_iter = 1000
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam([
    {'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
all_losses = []
for i in range(training_iter):
    # Zero gradients from previous iteration
    optimizer.zero_grad()
    # Output from model
    output = model(X_train_tensor)
    # Calc loss and backprop gradients
    loss = -mll(output, y_train_tensor)
    loss.backward()
    all_losses.append(loss.item())
    optimizer.step()


plt.plot(all_losses)

model.eval()
likelihood.eval()

with torch.no_grad(), gpytorch.settings.fast_pred_var():
    prediction = likelihood(model(X_test_tensor))
    mean = prediction.mean
    # Get lower and upper predictive bounds
    lower, upper = prediction.confidence_region()



noise = model.likelihood.noise.item()
length_scale = model.covar_module.base_kernel.lengthscale.item()

torch_mse = mse(mean.numpy(), y_test)
print('Torch MSE reduction : ', torch_mse/basic_mse)
print('Torch Lengthscale: ', length_scale)
print('Torch Noise: ', noise)

Incidentally, in this example GPytorch works better. More specifically, I am trying to understand the difference in the noise variance. By trying, I figured that the alpha parameter, which essentially sets the noise level on the diagonal of the covariance, has a drastic impact on how good the test mse is. If I understand it correctly, it is this parameter that is in torch inferred via backprop. Apart from that, the two kernels seem identical, provided the torch variant is allowed to sufficiently relax into a good solution.

My questions are:

How can I set and fixate the noise covariance on the pytorch kernel?

The inferred length scale seems so be quite different in between the different implementations, which is especially true on the proprietory data set of the original implementation. How could that be explained?

Are there any other factors that could explain a difference in performance? What aspects of the data would amplify such a difference?

Original Q&A

Sklearn Gaussian Process and GPytorch give different results

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in PROCESS

Related Questions in GAUSSIAN

Related Questions in GPYTORCH

Trending Questions

Popular # Hahtags

Popular Questions