Train score diminishes after polynomial regression degree increases

1.4k Views Asked by At

I'm trying to use linear regression to fit a polynomium to a set of points from a sinusoidal signal with some noise added, using linear_model.LinearRegression from sklearn.

As expected, the training and validation scores increase as the degree of the polynomium increases, but after some degree around 20 things start getting weird and the scores start going down, and the model returns polynomiums that don't look at all like the data that I use to train it.

Below are some plots where this can be seen, as well as the code that generated both the regression models and the plots:

How the thing works well until degree=17. Original data VS predictions:

After that it just gets worse:

Validation curve, increasing the degree of the polynomium:

from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.learning_curve import validation_curve

def make_data(N, err=0.1, rseed=1):
    rng = np.random.RandomState(1)
    x = 10 * rng.rand(N)
    X = x[:, None]
    y = np.sin(x) + 0.1 * rng.randn(N)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

def PolynomialRegression(degree=4):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression())


X, y = make_data(400)

X_test = np.linspace(0, 10, 500)[:, None]
degrees = np.arange(0, 40)

plt.figure(figsize=(16, 8))
plt.scatter(X.flatten(), y)
for degree in degrees:
    y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
    plt.plot(X_test, y_test, label='degre={0}'.format(degree))
plt.title('Original data VS predicted values for different degrees')
plt.legend(loc='best');


degree = np.arange(0, 40)
train_score, val_score = validation_curve(PolynomialRegression(), X, y,
                                                 'polynomialfeatures__degree',
                                                 degree, cv=7)

plt.figure(figsize=(12, 6))
plt.plot(degree, np.median(train_score, 1), marker='o', 
         color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), marker='o',
         color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.title('Learning curve, increasing the degree of the polynomium')
plt.xlabel('degree')
plt.ylabel('score');

I know the expected thing is that the validation score goes down when the complexity of the model increases, but why does the training score goes down as well? What can I be missing here?

2

There are 2 best solutions below

0
On

Training score is expected to go down as well due to overfitting of the model on training data. Error on validation goes down due to sine function's Taylor series expansion. So, as you increase degree of polynomial, your model improves to fit the sine curve better.

In ideal scenario if you don't have a function that expands to infinite degrees, you see training error going down (not monotonically but in general) and validation error going up after some degree (high for lower degrees -> low for some higher degree -> increasing after that).

0
On

First of all, here is how you can fix it by setting normalization flag True for the model;

def PolynomialRegression(degree=4):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(normalize=True))

But why? In linear regression fit() function finds best-fitting model with Moore–Penrose inverse which is a common way to compute least-square solution. When you add polynomials of the values, your augmented features become very large very quickly if you do not normalize. These large values dominate the cost computed by least-square and lead to a model fits to larger values i.e higher order polynomial values instead of the data.

Plots looks better and the way they are supposed to be. enter image description here

enter image description here