How does R's iml package handle syntactically invalid factor levels?

32 Views Asked by At

I'm using the iml package to derive ALE values from a caret trained rf model. In classification tasks where the levels of the dependent variable have syntactically invalid string values this can cause issues as under the hood these levels end up as column names during prediction.

Here is a silly example which will throw an undefined columns selected error with the last line of code:

# ----- Packages -----
library(randomForest)
library(caret)
library(iml)

# ----- Dummy Data -----
One <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
Two <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
Three <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
Four <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
df <- cbind.data.frame(One, Two, Three, Four)

# ----- Modelling + IML for syntactically invalid levels from "Three" -----
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(One, Two, Four)
rf <- caret::train(TrainData, Three, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE3 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results

I had some examples where a very simple modifcation did the trick, simply calling make.names in the 2nd last line of code like so:

Pred <- Predictor$new(rf, data=df, class=make.names(ALE.ClassOfInterest))

However, in the above example this does not help and the only solution I found is to use make.names at the very beginning to turn all levels into syntactically valid strings before even training the model (see column "Four"). However, I'd like to stick to the original strings for various reasons and I have noted that other equally invalid levels like "0", "1" (see column "One") don't require any workarounds and this works:

# ----- Modelling + IML for syntactically invalid levels from "One" -----
ALE.ClassOfInterest <- "1"
TrainData <- cbind.data.frame(Two, Three, Four)
rf <- caret::train(TrainData, One, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results

Does anyone know what is happening under the hood if it is not a plain make.names or can suggest a solution which let's me stick to the original factor levels in the model?

Thanks, Mark

2

There are 2 best solutions below

2
r2evans On BEST ANSWER

This appears to be a feature/bug already identified to the package author in issue iml/195. I'm not optimistic for a quick fix, since that issue was identified in July 2022 (20 months ago as of writing this answer) with no commentary from the author. (The last change to R functions was in April 2022, it does not appear to get many updates.)

0
MarkH On

For sake of completeness here is a complete example including the workaround I didn't really want to use which shows that:

for syntactically invalid levels like "0" make.names within iml's Predictor$newis not required and would actually cause an error and instead it just works as if it were syntactically correct

for syntactically invalid levels like "ABC01-01_02::XYZ02-01_2" make.names within iml's Predictor$new is a valid workaround

for syntactically invalid levels like "1 C-$_3.5" make.names within iml's Predictor$new is not a valid workaround but doing nothing as for "0" does not work either

creating syntactically valid levels by applying make.names before training a model works for all three examples above and does not require any special treatment within iml's Predictor$new

# Packages
library(randomForest)
library(caret)
library(iml)

# Syntactically Invalid Levels
I1 <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
I2 <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
I3 <- as.factor(sample(c("ABC01-01_02", "XYZ02-01_2", "ABC01-01_02::XYZ02-01_2"), size = 250, replace = TRUE))
df.invalid <- cbind.data.frame(I1,I2,I3)

# Syntactically Valid Levels
V1 <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
V2 <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
V3 <- as.factor(sample(make.names(c("ABC01-01_02", "XYZ02-01_2", "ABC01-01_02::XYZ02-01_2")), size = 250, replace = TRUE))
df.valid <- cbind.data.frame(V1,V2,V3)


# Using df.invalid trying to apply make.names within iml only

# Classification for "1"
ALE.ClassOfInterest <- "1"
TrainData <- cbind.data.frame(I2,I3)
rf <- caret::train(TrainData, I1, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
# For "1" no make.names is required
Pred <- Predictor$new(rf, data=df.invalid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# For "1" make.names causes an error
#Pred <- Predictor$new(rf, data=df.invalid, class=make.names(ALE.ClassOfInterest))
#FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results

# Classification for "1 C-$_3.5"
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(I1,I3)
rf <- caret::train(TrainData, I2, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
# For "1 C-$_3.5" no make.names causes an error
#Pred <- Predictor$new(rf, data=df.invalid, class=ALE.ClassOfInterest)
#FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# For "1 C-$_3.5" make.names also causes an error
Pred <- Predictor$new(rf, data=df.invalid, class=make.names(ALE.ClassOfInterest))
FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results

# Classification for "ABC01-01_02::XYZ02-01_2"
ALE.ClassOfInterest <- "ABC01-01_02::XYZ02-01_2"
TrainData <- cbind.data.frame(I1,I2)
rf <- caret::train(TrainData, I3, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
# For "ABC01-01_02::XYZ02-01_2" no make.names causes an error
#Pred <- Predictor$new(rf, data=df.invalid, class=ALE.ClassOfInterest)
#FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results
# For "ABC01-01_02::XYZ02-01_2" make.names avoids the error
Pred <- Predictor$new(rf, data=df.invalid, class=make.names(ALE.ClassOfInterest))
FE1 <- FeatureEffects$new(Pred, features=names(df.invalid), method="ale")$results


# Using df.valid applying make.names before model training

# Classification for "1"
ALE.ClassOfInterest <- make.names("1")
TrainData <- cbind.data.frame(V2,V3)
rf <- caret::train(TrainData, V1, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df.valid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.valid), method="ale")$results

# Classification for make.names("1 C-$_3.5")
ALE.ClassOfInterest <- make.names("1 C-$_3.5")
TrainData <- cbind.data.frame(V1,V3)
rf <- caret::train(TrainData, V2, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df.valid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.valid), method="ale")$results

# Classification for make.names("ABC01-01_02::XYZ02-01_2")
ALE.ClassOfInterest <- make.names("ABC01-01_02::XYZ02-01_2")
TrainData <- cbind.data.frame(V1,V2)
rf <- caret::train(TrainData, V3, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df.valid, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df.valid), method="ale")$results