How to keep all levels of categorical variables when splitting data frame in test and train set in R

2k Views Asked by botkop At 17 August 2025 at 16:51

Sometimes when splitting a data frame with categorical columns into a test and train set, the train set will not contain all levels of the categorical variable. When you then train the model, and try to predict the test set, the prediction will fail with:

For example:

x <- data.frame(...) # data frame with columns with very dispersed categorical variables
set.seed(123)
smp_size <- floor(0.75 * nrow(x))
train_idx <- sample(seq_len(nrow(x)), size = smp_size)
train_set <- x[train_idx, ]
test_set <- x[-train_idx, ]
m <- lm(some_formula, data=train_set)
predict(m, newdata=test_set)

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :  
    factor xxxx has new levels yyy ...

Does anyone know a handy way to set the levels of all categorical variables in both train and test set to the levels in the original data set ?

Thank you.

Original Q&A

There are 1 best solutions below

adpap On 14 November 2014 at 16:05 BEST ANSWER

The caret function createDataPartition() attempts to deal with the issue you describe.

Given your example above, you should be able to use it this way:

train_idx <- createDataPartition(y, times = 1, p = 0.75, list=F)

Here is a part of the R documentation on the function createDataPartition: "the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits."

How to keep all levels of categorical variables when splitting data frame in test and train set in R

There are 1 best solutions below

Related Questions in R

Related Questions in CATEGORICAL-DATA

Trending Questions

Popular # Hahtags

Popular Questions