Weighted Least Squares in Statsmodels vs. Numpy?

5.8k Views Asked by At

I am trying to replicate the functionality of Statsmodels's weight least squares (WLS) function with Numpy's ordinary least squares (OLS) function (i.e. Numpy refers to OLS as just "least squares").

In other words, I want to compute the WLS in Numpy. I used this Stackoverflow post as reference, but drastically different R² values arise moving from Statsmodel to Numpy.

Take the following example code that replicates this:

import numpy as np
import statsmodels.formula.api as smf
import pandas as pd

# Test Data
patsy_equation = "y ~ C(x) - 1" # Use minus one to get ride of hidden intercept of "+ 1"
weight = np.array([0.37, 0.37, 0.53, 0.754])
y = np.array([0.23, 0.55, 0.66, 0.88])
x = np.array([3, 3, 3, 3])
d = {"x": x.tolist(), "y": y.tolist()}
data_df = pd.DataFrame(data=d)

# Weighted Least Squares from Statsmodel API
statsmodel_model = smf.wls(formula=patsy_equation, weights=weight, data=data_df)
statsmodel_r2 = statsmodel_model.fit().rsquared      

# Weighted Least Squares from Numpy API
Aw = x.reshape((-1, 1)) * np.sqrt(weight[:, np.newaxis]) # Multiply two column vectors
Bw = y * np.sqrt(weight)
numpy_model, numpy_resid = np.linalg.lstsq(Aw, Bw, rcond=None)[:2]
numpy_r2 = 1 - numpy_resid / (Bw.size * Bw.var())

print("Statsmodels R²: " + str(statsmodel_r2))
print("Numpy R²: " + str(numpy_r2[0]))

After running such code, I get the following results:

Statsmodels R²: 2.220446049250313e-16
Numpy R²: 0.475486515775414

Clearly something is wrong here! Can anyone point out my flaws here? Am I miss understanding the patsy formula?

0

There are 0 best solutions below