How to quantify if model predictions are close with expected values in Python?

533 Views Asked by At

I have two sets of data where X are the observed and Y the expected values. I am trying to quantify the goodness of fit with Python. It is very common for people to calculate of the datasets and decide which is better based on these values, which is wrong. I want values that help me decide which dataset has observed values close to the expected values. I tried tests with Python but are there any other tests that can help in deciding which has the best fit.

Code

from scipy.stats import chisquare
import numpy as np

x1 = np.array([97.83, 95.06, 92.54, 97.69, 93.76, 93.36, 93.37, 99.29, 101.57, 
        97.88, 98.71, 75.31, 72.52, 67.75, 77.97, 78.42, 72.62, 82.29, 90.26, 76.32, 78.78, 79.96])
y1 = np.array([90.90, 90.50, 89.50, 92.90, 91.20, 91.70, 91.40, 94.20, 96.80,
        93.30, 94.40, 70.20, 71.20, 68.40, 74.20, 74.60, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])


x2 = ([92.14, 91.44, 91.31, 93.26, 93.26, 91.65, 92.41, 93.47, 97.12, 101.46, 
        94.99, 98.08, 69.33, 69.63, 68.45, 72.62, 71.17, 80.54, 90.42, 74.25, 79.60, 80.77])
y2 = ([90.90, 90.50, 89.50, 92.90, 93.00, 91.20, 91.70, 91.40, 94.20, 96.80, 93.30, 
        94.40, 70.20, 71.20, 68.40, 74.20, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])

print chisquare(x1, y1)
print chisquare(x2, y2)

Update

from scipy.stats import chisquare
from sklearn.metrics import r2_score
from scipy import stats
import numpy as np

x1 = np.array([97.83, 95.06, 92.54, 97.69, 93.76, 93.36, 93.37, 99.29, 101.57, 
        97.88, 98.71, 75.31, 72.52, 67.75, 77.97, 78.42, 72.62, 82.29, 90.26, 76.32, 78.78, 79.96])
y1 = np.array([90.90, 90.50, 89.50, 92.90, 91.20, 91.70, 91.40, 94.20, 96.80,
        93.30, 94.40, 70.20, 71.20, 68.40, 74.20, 74.60, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])


x2 = ([92.14, 91.44, 91.31, 93.26, 93.26, 91.65, 92.41, 93.47, 97.12, 101.46, 
        94.99, 98.08, 69.33, 69.63, 68.45, 72.62, 71.17, 80.54, 90.42, 74.25, 79.60, 80.77])
y2 = ([90.90, 90.50, 89.50, 92.90, 93.00, 91.20, 91.70, 91.40, 94.20, 96.80, 93.30, 
        94.40, 70.20, 71.20, 68.40, 74.20, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])


print "Scikit R2, 1:", r2_score(y1, x1)
print "Scikit R2, 2:", r2_score(y2, x2)


slope1, intercept1, r_value1, p_value1, std_err1 = stats.linregress(y1,x1)
slope2, intercept2, r_value2, p_value2, std_err2 = stats.linregress(y2,x2)


print "Stats R2, 1:", r_value1**2
print "Stats R2, 2", r_value2**2

With the updated code the following output is obtained:

Scikit R2, 1: 0.820091025592
Scikit R2, 2: 0.928643087517
Stats R2, 1: 0.958813342741
Stats R2, 2 0.965013525387

Why do R2 values obtained from scikit and scipy differ?

1

There are 1 best solutions below

0
On

The two functions you list (scipy.stats.linregress and sklearn.metrics.r2_score) do different things.

sklearn.metrics.r2_score

sklearn.metrics.r2_score does what you are looking for: it takes two sets of data, and computes the R^2 (coefficient of determination) between those two sets. From the docs:

sklearn.metrics.r2_score(y_true, y_pred, sample_weight=None, multioutput=None)

Parameters:

y_true : array-like of shape = (n_samples) or (n_samples, n_outputs)

Ground truth (correct) target values.

y_pred : array-like of shape = (n_samples) or (n_samples, n_outputs)

Estimated target values.

So, your observed data (x1,x2) are your y_true, and your expected values (y1,y2) are your y_pred. So, this is the correct way to call it:

r2_score(x1, y1)

scipy.stats.linregress

scipy.stats.linregress does not do what you are looking for. Its purpose is to perform a linear regression and find a fit to two sets of data (not a set of data and its predicted values). The r_value it returns (which you can square to get R^2, is the correlation coefficient between the y value you feed it and the predicted values from the regression (fitting) it performs. Since you already know your predicted values, this is not the function you are looking for.