Hi I'm a beginner in R programming language. I wrote one code for regression tree using rpart package. In my data some of my independent variables have more than 100 levels. After running the rpart function I'm getting following warning message "More than 52 levels in a predicting factor, truncated for printout" & my tree is showing in very weird way. Say for example my tree is splitting by location which has around 70 distinct levels, but when the label is displaying in tree then it is showing "ZZZZZZZZZZZZZZZZ..........." where I don't have any location called "ZZZZZZZZ"
Please help me.
Thanks in advance.
Many of the functions in R have limits on the number of levels a factor-type variable can have (ie
randomForest
limits the number of levels of a factor to 32).One way that I've seen it dealt with especially in data mining competitions is to:
1) Determine maximum number of levels allowed for a given function (call this
X
).2) Use
table()
to determine the number of occurrences of each level of the factor and rank them from greatest to least.3) For the top
X - 1
levels of the factor leave them as is.4) For the levels <
X
change them all to one factor to identify them as low-occurrence levels.Here's an example that's a bit long but hopefully helps:
Finally, you may want to consider using truncated variables in
rpart
as the tree display gets very busy when there are a large number of variables or they have long names.