SciKit Learn R-squared is very different from square of Pearson's Correlation R

Question

SciKit Learn R-squared is very different from square of Pearson's Correlation R

6k Views Asked by daragh At 21 January 2021 at 11:04

I have 2 numpy arrays ike so:

a = np.array([32.0, 25.97, 26.78, 35.85, 30.17, 29.87, 30.45, 31.93, 30.65, 35.49, 
              28.3, 35.24, 35.98, 38.84, 27.97, 26.98, 25.98, 34.53, 40.39, 36.3])

b = np.array([28.778585, 31.164268, 24.690865, 33.523693, 29.272448, 28.39742,
              28.950092, 29.701189, 29.179174, 30.94298 , 26.05434 , 31.793175,
              30.382706, 32.135723, 28.018875, 25.659306, 27.232124, 28.295502,
              33.081223, 30.312504])

When I calculate the R-squared using SciKit Learn I get a completely different value than when I calculate Pearson's Correlation and then square the result:

sk_r2 = sklearn.metrics.r2_score(a, b)
print('SciKit R2: {:0.5f}\n'.format(sk_r2))

pearson_r = scipy.stats.pearsonr(a, b)
print('Pearson R: ', pearson_r)
print('Pearson R squared: ', pearson_r[0]**2)

Results in:
SciKit R2: 0.15913

Pearson R: (0.7617075766854164, 9.534162339384296e-05)
Pearson R squared: 0.5801984323799696

I realize that an R-squared value can sometimes be negative for a poorly fitting model (https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative) and therefore the square of Pearson's correlation is not always equal to R-squared. However, I thought that for a positive R-squared value it was always equal to Pearson's correlation squared? How are these R-squared values so different?

Original Q&A

There are 2 best solutions below

Alireza Amani On 04 February 2021 at 21:34

I was in the same situation too. For me it happened when I compared R-squared in scikit-learn with R-squared as it is calculated by R caret package.

The R-squared in R caret package, or in your case in scipy.stats.pearsonr is the square of "Pearson R" by the definition. A measure of correlation. See its definition here(by definition could be between zero and 1).

However, the R-squared in scikit-learn is a measure of accuracy, you can look at its definition in its user guide.(by definition could be between -Inf and 1).

Bottom line, don't compare them. They are different measures.

**Max Pierini** · Accepted Answer · 2021-03-27T16:24:55.017000

Pearson correlation coefficient R and R-squared coefficient of determination are two completely different statistics.

You can take a look at https://en.wikipedia.org/wiki/Pearson_correlation_coefficient and https://en.wikipedia.org/wiki/Coefficient_of_determination

update

Persons's r coefficient is a measure of linear correlation between two variables and is

where bar x and bar y are the means of the samples.

R2 coefficient of determination is a measure of goodness of fit and is

where hat y is the predicted value of y and bar y is the mean of the sample.

Thus

they measure different things
r**2 is not equal to R2 because their formula are totally different

update 2

r**2 is equal to R2 only in the case that you calculate r with a variable (say y) and the predicted variable hat y from a linear model

Let's make an example using the two arrays you provided

import numpy as np
import pandas as pd
import scipy.stats as sps
import statsmodels.api as sm
from sklearn.metrics import r2_score as R2
import matplotlib.pyplot as plt

a = np.array([32.0, 25.97, 26.78, 35.85, 30.17, 29.87, 30.45, 31.93, 30.65, 35.49, 
              28.3, 35.24, 35.98, 38.84, 27.97, 26.98, 25.98, 34.53, 40.39, 36.3])

b = np.array([28.778585, 31.164268, 24.690865, 33.523693, 29.272448, 28.39742,
              28.950092, 29.701189, 29.179174, 30.94298 , 26.05434 , 31.793175,
              30.382706, 32.135723, 28.018875, 25.659306, 27.232124, 28.295502,
              33.081223, 30.312504])

df = pd.DataFrame({
    'x': a,
    'y': b,
})

df.plot(x='x', y='y', marker='.', ls='none', legend=False);

now we fit a linear regression model

mod = sm.OLS.from_formula('y ~ x', data=df)
mod_fit = mod.fit()
print(mod_fit.summary())

output

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.580
Model:                            OLS   Adj. R-squared:                  0.557
Method:                 Least Squares   F-statistic:                     24.88
Date:                Mon, 29 Mar 2021   Prob (F-statistic):           9.53e-05
Time:                        14:12:15   Log-Likelihood:                -36.562
No. Observations:                  20   AIC:                             77.12
Df Residuals:                      18   BIC:                             79.12
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     16.0814      2.689      5.979      0.000      10.431      21.732
x              0.4157      0.083      4.988      0.000       0.241       0.591
==============================================================================
Omnibus:                        6.882   Durbin-Watson:                   3.001
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.363
Skew:                           0.872   Prob(JB):                        0.113
Kurtosis:                       4.481   Cond. No.                         245.
==============================================================================

and compute both r**2 and R2 and we can see that in this case they're equal

predicted_y = mod_fit.predict(df.x)
print("R2 :", R2(df.y, predicted_y))
print("r^2:", sps.pearsonr(df.y, predicted_y)[0]**2)

output

R2 : 0.5801984323799696
r^2: 0.5801984323799696

You did R2(df.x, df.y) that can't be equal to our computed values because you used a measure of goodness-of-fit between independent x and dependent y variables. We instead used both r and R2 with y and predicted value of y.

SciKit Learn R-squared is very different from square of Pearson's Correlation R

There are 2 best solutions below

update

update 2

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in PEARSON-CORRELATION

Related Questions in SCIPY.STATS

Related Questions in COEFFICIENT-OF-DETERMINATION

Trending Questions

Popular # Hahtags

Popular Questions