I am trying to use a OLS regression to predict missing (NAN) values of ustar using know data of wind speed (WS), variation of WS by month, and radiation (Rn) using known values of all the variables just mentioned. All variables within the formula do have some missing data at some point within the dataframe, but my regression formula gave me strong correlations with all my variables in the formula and an R squared valued of .80, so I know this gap-filling method of predicted regression data is feasible. Here is my code below:
regression_data = pd.DataFrame([])
regression_data['ustar'] = data['ustar']
regression_data['WS'] = data['WS']
regression_data['Rn'] = data['Rn']
regression_data['month'] = data.index.month
formula = "ustar ~ WS + (WS:C(month)) + (WS:Rn) + 1"
regression_model = sm.regression.linear_model.OLS.from_formula(formula,regression_data)
results = regression_model.fit()
predicted_values = results.predict(regression_data)
Traceback (most recent call last):
File "<ipython-input-61-073df0b2ae63>", line 1, in <module>
predicted_values = results.predict(regression_data)
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/statsmodels/base/model.py", line 739, in predict
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2360, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'design_info'
I understand that there has been past similar issues with the same error, but I do know if the complexity of my formula is not handling well inside the "predict" attribute coding. I was wondering if anyone has a perspective of how I should approach this problem.