i have a dataframe like this (it's bigger and there are more features):
Date Influenza[it] Febbre[it] Cefalea[it] Paracetamolo[it] \
0 2008-01 989 2395 1291 2933
1 2008-02 962 2553 1360 2547
2 2008-03 1029 2309 1401 2735
3 2008-04 1031 2399 1137 2296
... ...
tot_incidence
0 4.56
1 5.98
2 6.54
3 6.95
....
First of all i made a ols regression on the dataframe without splitting in training/test sets and this is the 'input configuration' that worked (tot_incidence
is to predict, Influenza[it]
, Febbre[it]
and Cefalea[it]
are the features):
fin1=fin1.rename(columns = {'tot_incidence':'A','Influenza[it]':'B', 'Febbre[it]':'C','Cefalea[it]':'D'})
result = sm.ols(formula="A ~ B + C + D", data=fin1).fit()
OK. Now i want to make a training and test set.
Tried classic split and k-fold
1° Classic split
Probably that's easier, I could do this:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.3, random_state=1)
And then insert the variables in the OLS model:
x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()
predictions = results.predict(X_test)
In this case how can i make the x,y
from the dataframe to insert them in the cross_validation.train_test_split
function?
2° K-fold (if too hard don't waste time on it)
For example i could do this:
from sklearn import cross_validation
array = dataframe.values
X = array[:,1:3]
Y = array[:,5]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
At this point i'm stuck, how can i insert this variable in the ols to make then the prediction? Is there a better way to make the training/test sets?
You need to convert dataframe columns into inputs (
x,y
) that an algorithm can understand, i.e. convert columns of a dataframe into either numbers or categories, depending on the type of algorithm you are trying to perform.1) Select the variable in your dataframe that is your response/predictor, i.e. your Y variable. Say that's
Influenza
:y = df.Influenze.values # convert to a numpy array
2) Select the X variables, say
Febbre, Cefalea, Paracetamolo
:X = np.column_stack([df.Febbre.values, df.Cefalea.values, df.Paracetamolo.values])
Now you can call the
cross_validation.train_test_split
function.Note that if your variables are categories, then you'll have to use some sort of categorization, such as one-hot.