I have two separate data sets: one for train
(1000000 observation) and the other one for test
(1000000 observation). I divided the train
set into 3 sets (mytrain
: 700000 observations, myvalid
: 150000 observations, mytest
:150000 observations). Thetest
set with 1000000 observations doesn't include the target variable, so it should be used for the final test. Since there are some missing values for categorical variables, I need to use mice to impute them. I should reuse the imputation done on mytrain
set to fill the missing values in the myvalid
, mytest
and test
sets. Based on the answer to this question, I should do this:
data2 <- rbind(mytrain,myval,mytest,test)
data2$ST_EMPL <- as.factor(data2$ST_EMPL)
data2$TYP_RES <- as.factor(data2$TYP_RES)
imp <- mice(data2, method = "cart", m = 1, maxit = 1, seed = 123,
ignore = c(rep(FALSE, 700000),rep(TRUE, 1300000)))
data2.imp <- complete(imp,1)
summary(imp)
mytrainN <- data2.imp[1:700000,]
myvalN <- data2.imp[700001:850000,]
mytestN <- data2.imp[850001:1000000,]
testN <- data2.imp[1000001:2000000,]
However, since the test
set does not have the target column, it is not possible to merge it with mytrain
, mytest
, and myvalid
. Is it possible to add a hypothetical target column (with the value of say 10 for all 1000000 observations) to the test
set?