I'm experiencing some discrepancies when comparing different calculations of root mean square error (RMSE). What explains these discrepancies? My guesses are (1) rounding or (2) statistic methodology (e.g., sample vs. population).
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.metrics import mean_squared_error
data = sm.datasets.strikes.load_pandas()
X = data.data['duration']
y = data.data['iprod']
X = add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
a = np.sqrt(results.mse_resid)
b = np.sqrt(np.dot(results.resid, results.resid) / len(results.resid))
c = np.sqrt(np.square(results.resid).mean())
d = np.sqrt(1 - results.rsquared_adj)*y.std()
e = np.sqrt(mean_squared_error(results.fittedvalues, y))
f = np.sqrt( (np.linalg.norm(results.fittedvalues - y)**2)/len(y) )
print("\n a = ", a, "\n b = ", b, "\n c = ", c, "\n d = ", d, "\n e = ", e, "\n f = ", f)
Results
a = 0.043831898071428385
b = 0.043119136780037336
c = 0.043119136780037336
d = 0.043831898071428385
e = 0.043119136780037336
f = 0.043119136780037336
The explanation for the discrepancies is based on the adjustment for the number of parameters in the regression model (k).
Results