Root Mean Squared Error - Calculation Discrepancies in Python

146 Views Asked by At

I'm experiencing some discrepancies when comparing different calculations of root mean square error (RMSE). What explains these discrepancies? My guesses are (1) rounding or (2) statistic methodology (e.g., sample vs. population).

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.metrics import mean_squared_error
data = sm.datasets.strikes.load_pandas()
X = data.data['duration']
y = data.data['iprod']
X = add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
a = np.sqrt(results.mse_resid)
b = np.sqrt(np.dot(results.resid, results.resid) / len(results.resid))
c = np.sqrt(np.square(results.resid).mean())
d = np.sqrt(1 - results.rsquared_adj)*y.std()
e = np.sqrt(mean_squared_error(results.fittedvalues, y))
f = np.sqrt( (np.linalg.norm(results.fittedvalues - y)**2)/len(y) )
print("\n a = ", a, "\n b = ", b, "\n c = ", c, "\n d = ", d, "\n e = ", e, "\n f = ", f)

Results

 a =  0.043831898071428385 
 b =  0.043119136780037336 
 c =  0.043119136780037336 
 d =  0.043831898071428385 
 e =  0.043119136780037336 
 f =  0.043119136780037336
1

There are 1 best solutions below

0
Misha On

The explanation for the discrepancies is based on the adjustment for the number of parameters in the regression model (k).

# Adjusted for k
## From model results that adjust for k 
### results.mse_resid
a = np.sqrt(results.mse_resid)
### results.rsquared_adj
d = np.sqrt(1 - results.rsquared_adj)*y.std()
## From model results that are not adjusted for k, but adjusting for k "manually"
### results.resid & results.params.size
b1 = np.sqrt(np.dot(results.resid, results.resid) / ( len(y) - results.params.size) )
c1 = np.sqrt(np.square(results.resid).mean()*((len(y)/(len(y) - results.params.size))))
### results.fittedvalues & results.params.size
e1 = np.sqrt(mean_squared_error(results.fittedvalues, y) *((len(y)/(len(y) - results.params.size))))
f1 = np.sqrt((np.linalg.norm(results.fittedvalues - y)**2)/(len(y) - results.params.size))

# Not adjusted for k
b = np.sqrt(np.dot(results.resid, results.resid) / len(results.resid))
c = np.sqrt(np.square(results.resid).mean())
e = np.sqrt(mean_squared_error(results.fittedvalues, y) )
f = np.sqrt( (np.linalg.norm(results.fittedvalues - y)**2)/len(y) )

print("Adjusted for k\n a =\t", a, "\n d =\t", d, "\n b1 =\t", b1,
      "\n c1 =\t", c1, "\n e1 =\t", e1, "\n f1 =\t", f1, 
      "\nNot adjusted for k\n b =\t", b, "\n c =\t", c, "\n e =\t", e, "\n f =\t", f)

Results

Adjusted for k
 a =     0.043831898071428385 
 d =     0.043831898071428385 
 b1 =    0.043831898071428385 
 c1 =    0.04383189807142839 
 e1 =    0.04383189807142839 
 f1 =    0.043831898071428385 
Not adjusted for k
 b =     0.043119136780037336 
 c =     0.043119136780037336 
 e =     0.043119136780037336 
 f =     0.043119136780037336