R:More than 52 levels in a predicting factor, truncated for printout

4.8k Views Asked by At

Hi I'm a beginner in R programming language. I wrote one code for regression tree using rpart package. In my data some of my independent variables have more than 100 levels. After running the rpart function I'm getting following warning message "More than 52 levels in a predicting factor, truncated for printout" & my tree is showing in very weird way. Say for example my tree is splitting by location which has around 70 distinct levels, but when the label is displaying in tree then it is showing "ZZZZZZZZZZZZZZZZ..........." where I don't have any location called "ZZZZZZZZ"

Please help me.

Thanks in advance.

1

There are 1 best solutions below

5
On

Many of the functions in R have limits on the number of levels a factor-type variable can have (ie randomForest limits the number of levels of a factor to 32).

One way that I've seen it dealt with especially in data mining competitions is to:

1) Determine maximum number of levels allowed for a given function (call this X).

2) Use table() to determine the number of occurrences of each level of the factor and rank them from greatest to least.

3) For the top X - 1 levels of the factor leave them as is.

4) For the levels < X change them all to one factor to identify them as low-occurrence levels.

Here's an example that's a bit long but hopefully helps:

# Generate 1000 random numbers between 0 and 100.
vars1 <- data.frame(values1=(round(runif(1000) * 100,0)))
# Changes values to factor variable.
vars1$values1 <- factor(vars1$values1)
# Show top 6 rows of data frame.
head(vars1)
# Show the number of unique factor levels
length(unique(vars1$values1 ))
# Create table showing frequency of each levels occurrence.
table1 <- data.frame(table(vars1 ))
# Orders the table in descending order of frequency.
table1 <- table1[order(-table1$Freq),]
head(table1)
# Assuming we want to use the CART we choose the top 51
# levels to leave unchanged
# Get values of top 51 occuring levels
noChange <- table1$vars1[1:51]
# we use '-1000' as factor to avoid overlap w/ other levels (ie if '52' was 
# actually one of the levels).
# ifelse() checks to see if the factor level is in the list of the top 51
# levels.  If present it uses it as is, if not it changes it to '-1000'
vars1$newFactor <- (ifelse(vars1$values1 %in% noChange, vars1$values1, "-1000")) 
# Show the number of levels of the new factor column.
length(unique(vars1$newFactor))

Finally, you may want to consider using truncated variables in rpart as the tree display gets very busy when there are a large number of variables or they have long names.