How to keep all levels of categorical variables when splitting data frame in test and train set in R

2k Views Asked by At

Sometimes when splitting a data frame with categorical columns into a test and train set, the train set will not contain all levels of the categorical variable. When you then train the model, and try to predict the test set, the prediction will fail with:

For example:

x <- data.frame(...) # data frame with columns with very dispersed categorical variables
set.seed(123)
smp_size <- floor(0.75 * nrow(x))
train_idx <- sample(seq_len(nrow(x)), size = smp_size)
train_set <- x[train_idx, ]
test_set <- x[-train_idx, ]
m <- lm(some_formula, data=train_set)
predict(m, newdata=test_set)

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :  
    factor xxxx has new levels yyy ...

Does anyone know a handy way to set the levels of all categorical variables in both train and test set to the levels in the original data set ?

Thank you.

1

There are 1 best solutions below

0
On BEST ANSWER

The caret function createDataPartition() attempts to deal with the issue you describe.

Given your example above, you should be able to use it this way:

train_idx <- createDataPartition(y, times = 1, p = 0.75, list=F)

Here is a part of the R documentation on the function createDataPartition: "the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits."