I have some strange problem in for loops with ctree data. If I write this code in a loop then R freezes.
data = read.csv("train.csv") #data description https://www.kaggle.com/c/titanic-gettingStarted/data
treet = ctree(Survived ~ ., data = data)
print(plot(treet))
Sometimes I get an error: "More than 52 levels in a predicting factor, truncated for printout" and my tree is showing in very weird way. Sometimes works just fine. Really, really strange!
My Loop code:
functionPlot <- function(traine, i) {
print(i) # print only once, then RStudio freezes
tempd <- ctree(Survived ~ ., data = traine)
print(plot(tempd))
}
for(i in 1:2) {
smp_size <- floor(0.70 * nrow(data))
train_ind <- sample(seq_len(nrow(data)), size = smp_size)
set.seed(100 + i)
train <- data[train_ind, ]
test <- data[-train_ind, ]
#
functionPlot(train,i)
}
The
ctree()function expects that (a) appropriate classes (numeric, factor, etc.) are used for each variable, and that (b) only useful predictors are employed in the model formula.As for (b) you have supplied variables that are really just characters (like the
Name) and not factors. This would either need to be pre-processed appropriately or omitted from the analysis.Even if you do not, you will not get the best results because some variables (like
SurvivedandPclass) are coded numerically but are really categorical variables that should be factors. If you look at the scripts from https://www.kaggle.com/c/titanic/forums/t/13390/introducing-kaggle-scripts then you will also see how the data preparation can be carried out. Here, I useAs for (b), I then go on to call
ctree()with only the variables which have been sufficiently pre-processed for meaningful analysis. (And I use the newer recommended implementation from packagepartykit.)This yields the following graphical output:
And the following print output: