Make a cross validation on a dataframe for an OLS regression model

2.2k Views Asked by At

i have a dataframe like this (it's bigger and there are more features):

        Date  Influenza[it]  Febbre[it]  Cefalea[it]  Paracetamolo[it]  \
0    2008-01            989        2395         1291              2933   
1    2008-02            962        2553         1360              2547   
2    2008-03           1029        2309         1401              2735   
3    2008-04           1031        2399         1137              2296   
       ...              ...

     tot_incidence  
0           4.56  
1           5.98  
2           6.54  
3           6.95  
            ....

First of all i made a ols regression on the dataframe without splitting in training/test sets and this is the 'input configuration' that worked (tot_incidence is to predict, Influenza[it], Febbre[it] and Cefalea[it] are the features):

fin1=fin1.rename(columns = {'tot_incidence':'A','Influenza[it]':'B', 'Febbre[it]':'C','Cefalea[it]':'D'})
result = sm.ols(formula="A ~ B + C + D", data=fin1).fit()

OK. Now i want to make a training and test set.

Tried classic split and k-fold

1° Classic split

Probably that's easier, I could do this:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.3, random_state=1)

And then insert the variables in the OLS model:

x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()
predictions = results.predict(X_test)

In this case how can i make the x,y from the dataframe to insert them in the cross_validation.train_test_split function?

2° K-fold (if too hard don't waste time on it)

For example i could do this:

from sklearn import cross_validation
array = dataframe.values
X = array[:,1:3]
Y = array[:,5]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)

At this point i'm stuck, how can i insert this variable in the ols to make then the prediction? Is there a better way to make the training/test sets?

1

There are 1 best solutions below

0
On

In this case how can i make the x,y from the dataframe to insert them in the cross_validation.train_test_split function?

You need to convert dataframe columns into inputs (x,y) that an algorithm can understand, i.e. convert columns of a dataframe into either numbers or categories, depending on the type of algorithm you are trying to perform.

1) Select the variable in your dataframe that is your response/predictor, i.e. your Y variable. Say that's Influenza:
y = df.Influenze.values # convert to a numpy array

2) Select the X variables, say Febbre, Cefalea, Paracetamolo:
X = np.column_stack([df.Febbre.values, df.Cefalea.values, df.Paracetamolo.values])

Now you can call the cross_validation.train_test_split function.

Note that if your variables are categories, then you'll have to use some sort of categorization, such as one-hot.