I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.
auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())
The data consists the following variables - mpg, cylinders, displacement, horsepower, weight , acceleration, year, origin and name. When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct. Im not sure why?
It's likely because of the data type of the
horsepowercolumn. If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing. Check the data type (runauto_1.dtypes) and cast the column to a numeric type (it's best to do it when you are first reading the csv file with thedtype=parameter of theread_csv()method.Here is an example where a column with numeric values is cast (i.e. converted) to strings (or categories):
Output (dummy coding):
Now, converting the strings back to integers:
Output (as expected):