Using pandas pd.cut to generate a categorical variable with statsmodels

1.7k Views Asked by At

I have tried to use pd.cut to create a categorical variable from a continuous variable. I'd like to use this in a subsequent statsmodel defined regression including this dummy variable. When I create a categorical variable created in this way, I get an error

TypeError: data type not understood.    

A test case is included below.

import numpy as np
import pandas as pd
import statsmodels as sm
import statsmodels.formula.api as smf
df = pd.DataFrame(np.random.randn(6,4))
df.columns = ['A', 'B', 'C', 'D']
df['ttt']=pd.cut(df['D'], bins=2)
test = smf.ols('A ~ B + ttt', data=df).fit()

I'm sure I've done something obviously wrong. Any help would be appreciated.

1

There are 1 best solutions below

0
On BEST ANSWER

I'm not sure exactly where statsmodels is at in terms of including support for the new Categorical type in pandas. For the moment, you may have to convert the categorical back into an object type for it to work (please check that the resulting ols fit is sensible, I don't know the full details of what you're trying to do):

df['ttt_fixed'] = df.ttt.astype(np.object)
test = smf.ols('A ~ B + ttt_fixed', data=df).fit()
test.summary()