I have 2 numpy arrays ike so:
a = np.array([32.0, 25.97, 26.78, 35.85, 30.17, 29.87, 30.45, 31.93, 30.65, 35.49,
28.3, 35.24, 35.98, 38.84, 27.97, 26.98, 25.98, 34.53, 40.39, 36.3])
b = np.array([28.778585, 31.164268, 24.690865, 33.523693, 29.272448, 28.39742,
28.950092, 29.701189, 29.179174, 30.94298 , 26.05434 , 31.793175,
30.382706, 32.135723, 28.018875, 25.659306, 27.232124, 28.295502,
33.081223, 30.312504])
When I calculate the R-squared using SciKit Learn I get a completely different value than when I calculate Pearson's Correlation and then square the result:
sk_r2 = sklearn.metrics.r2_score(a, b)
print('SciKit R2: {:0.5f}\n'.format(sk_r2))
pearson_r = scipy.stats.pearsonr(a, b)
print('Pearson R: ', pearson_r)
print('Pearson R squared: ', pearson_r[0]**2)
Results in:
SciKit R2: 0.15913
Pearson R: (0.7617075766854164, 9.534162339384296e-05)
Pearson R squared: 0.5801984323799696
I realize that an R-squared value can sometimes be negative for a poorly fitting model (https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative) and therefore the square of Pearson's correlation is not always equal to R-squared. However, I thought that for a positive R-squared value it was always equal to Pearson's correlation squared? How are these R-squared values so different?
Pearson correlation coefficient R and R-squared coefficient of determination are two completely different statistics.
You can take a look at https://en.wikipedia.org/wiki/Pearson_correlation_coefficient and https://en.wikipedia.org/wiki/Coefficient_of_determination
update
Persons's r coefficient is a measure of linear correlation between two variables and is
where
bar xandbar yare the means of the samples.R2 coefficient of determination is a measure of goodness of fit and is
where
hat yis the predicted value ofyandbar yis the mean of the sample.Thus
r**2is not equal toR2because their formula are totally differentupdate 2
r**2is equal toR2only in the case that you calculaterwith a variable (sayy) and the predicted variablehat yfrom a linear modelLet's make an example using the two arrays you provided
now we fit a linear regression model
output
and compute both
r**2andR2and we can see that in this case they're equaloutput
You did
R2(df.x, df.y)that can't be equal to our computed values because you used a measure of goodness-of-fit between independentxand dependentyvariables. We instead used bothrandR2withyand predicted value ofy.