I have two sets of data where X are the observed and Y the expected values. I am trying to quantify the goodness of fit with Python. It is very common for people to calculate of the datasets and decide which is better based on these values, which is wrong. I want values that help me decide which dataset has observed values close to the expected values. I tried
tests with Python but are there any other tests that can help in deciding which has the best fit.
Code
from scipy.stats import chisquare
import numpy as np
x1 = np.array([97.83, 95.06, 92.54, 97.69, 93.76, 93.36, 93.37, 99.29, 101.57,
97.88, 98.71, 75.31, 72.52, 67.75, 77.97, 78.42, 72.62, 82.29, 90.26, 76.32, 78.78, 79.96])
y1 = np.array([90.90, 90.50, 89.50, 92.90, 91.20, 91.70, 91.40, 94.20, 96.80,
93.30, 94.40, 70.20, 71.20, 68.40, 74.20, 74.60, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])
x2 = ([92.14, 91.44, 91.31, 93.26, 93.26, 91.65, 92.41, 93.47, 97.12, 101.46,
94.99, 98.08, 69.33, 69.63, 68.45, 72.62, 71.17, 80.54, 90.42, 74.25, 79.60, 80.77])
y2 = ([90.90, 90.50, 89.50, 92.90, 93.00, 91.20, 91.70, 91.40, 94.20, 96.80, 93.30,
94.40, 70.20, 71.20, 68.40, 74.20, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])
print chisquare(x1, y1)
print chisquare(x2, y2)
Update
from scipy.stats import chisquare
from sklearn.metrics import r2_score
from scipy import stats
import numpy as np
x1 = np.array([97.83, 95.06, 92.54, 97.69, 93.76, 93.36, 93.37, 99.29, 101.57,
97.88, 98.71, 75.31, 72.52, 67.75, 77.97, 78.42, 72.62, 82.29, 90.26, 76.32, 78.78, 79.96])
y1 = np.array([90.90, 90.50, 89.50, 92.90, 91.20, 91.70, 91.40, 94.20, 96.80,
93.30, 94.40, 70.20, 71.20, 68.40, 74.20, 74.60, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])
x2 = ([92.14, 91.44, 91.31, 93.26, 93.26, 91.65, 92.41, 93.47, 97.12, 101.46,
94.99, 98.08, 69.33, 69.63, 68.45, 72.62, 71.17, 80.54, 90.42, 74.25, 79.60, 80.77])
y2 = ([90.90, 90.50, 89.50, 92.90, 93.00, 91.20, 91.70, 91.40, 94.20, 96.80, 93.30,
94.40, 70.20, 71.20, 68.40, 74.20, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])
print "Scikit R2, 1:", r2_score(y1, x1)
print "Scikit R2, 2:", r2_score(y2, x2)
slope1, intercept1, r_value1, p_value1, std_err1 = stats.linregress(y1,x1)
slope2, intercept2, r_value2, p_value2, std_err2 = stats.linregress(y2,x2)
print "Stats R2, 1:", r_value1**2
print "Stats R2, 2", r_value2**2
With the updated code the following output is obtained:
Scikit R2, 1: 0.820091025592
Scikit R2, 2: 0.928643087517
Stats R2, 1: 0.958813342741
Stats R2, 2 0.965013525387
Why do R2 values obtained from scikit and scipy differ?
The two functions you list (
scipy.stats.linregress
andsklearn.metrics.r2_score
) do different things.sklearn.metrics.r2_score
sklearn.metrics.r2_score
does what you are looking for: it takes two sets of data, and computes theR^2
(coefficient of determination) between those two sets. From the docs:So, your observed data (
x1,x2
) are youry_true
, and your expected values (y1,y2
) are youry_pred
. So, this is the correct way to call it:scipy.stats.linregress
scipy.stats.linregress
does not do what you are looking for. Its purpose is to perform a linear regression and find a fit to two sets of data (not a set of data and its predicted values). Ther_value
it returns (which you can square to getR^2
, is the correlation coefficient between they
value you feed it and the predicted values from the regression (fitting) it performs. Since you already know your predicted values, this is not the function you are looking for.