Overcoming compatibility issues with using iml from h2o models

Question

Overcoming compatibility issues with using iml from h2o models

260 Views Asked by user1809593 At 11 November 2021 at 14:43

I am unable to reproduce the only example I can find of using h2o with iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) as detailed here (Error when extracting variable importance with FeatureImp$new and H2O). Can anyone point to a workaround or other examples of using iml with h2o?

Reproducible example:

library(rsample)   # data splitting
library(ggplot2)   # allows extension of visualizations
library(dplyr)     # basic data transformation
library(h2o)       # machine learning modeling
library(iml)       # ML interprtation
library(modeldata) #attrition data 


# initialize h2o session
h2o.no_progress()
h2o.init()

# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>% 
  mutate_if(is.ordered, factor, ordered = FALSE) %>%
  mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))

# convert to h2o object
df.h2o <- as.h2o(df)

# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames = 
    c("train","valid","test"))
names(splits) <- c("train","valid","test")

# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y) 

# elastic net model 
glm <- h2o.glm(
  x = x, 
  y = y, 
  training_frame = splits$train,
  validation_frame = splits$valid,
  family = "binomial",
  seed = 123
  )

# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)

# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))

# 3. Create custom predict function that returns the predicted values as a
#    vector (probability of purchasing in our example)
pred <- function(model, newdata)  {
  results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
  return(results[[3L]])
}

# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
  model = glm, 
  data = features, 
  y = response, 
  predict.fun = pred,
  class = "classification"
  )

imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")

Error obtained:

Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns 
selected

traceback()

1. FeatureImp$new(predictor.glm, loss = "mse")

2. .subset2(public_bind_env, "initialize")(...)

3. private$run.prediction(private$sampler$X)

4. self$predictor$predict(data.frame(dataDesign))

5. prediction[, self$class, drop = FALSE]

6. `[.data.frame`(prediction, , self$class, drop = FALSE)

7. stop("undefined columns selected")

Original Q&A

There are 1 best solutions below

**TheFon** · Answer 1 · 2022-02-17T06:45:55.373000

In the iml package documentation, it says that the class argument is "The class column to be returned.". When you set class = "classification", it's looking for a column called "classification" which is not found. At least on GitHub, it looks like the iml package has gone through a fair amount of development since that blog post, so I imagine some functionality may not be backwards compatible anymore.

After reading through the package documentation, I think you might want to try something like:

predictor.glm <- Predictor$new(
  model = glm, 
  data = features, 
  y = "Attrition",
  predict.function = pred,
  type = "prob"
  )

# check ability to predict first
check <- predictor.glm$predict(features)
print(check)

Even better might be to leverage H2O's extensive functionality around machine learning interpretability.

h2o.varimp(glm) will give the user the variable importance for each feature

h2o.varimp_plot(glm, 10) will render a graphic showing the relative importance of each feature.

h2o.explain(glm, as.h2o(features)) is a wrapper for the explainability interface and will by default provide the confusion matrix (in this case) as well as variable importance, and partial dependency plots for each feature.

For certain algorithms (e.g., tree-based methods), h2o.shap_explain_row_plot() and h2o.shap_summary_plot() will provide the shap contributions.

The h2o-3 docs might be useful here to explore more

Overcoming compatibility issues with using iml from h2o models

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in H2O

Related Questions in IML

Related Questions in DALEX

Trending Questions

Popular # Hahtags

Popular Questions