Tidymodels Workflow working with add_formula() or add_variables() but not with add_recipe()

740 Views Asked by At

I encountered some weird behavior using a recipe and a workflow to descriminate spam from valid texts using a naiveBayes classifier. I was trying to replicate using tidymodels and a workflow the results the 4th chapter of the book Machine learning with R: https://github.com/PacktPublishing/Machine-Learning-with-R-Second-Edition/blob/master/Chapter%2004/MLwR_v2_04.r

While I was able to reproduce the analysis either with add_variables() or add_formula() or with no workflow, the workflow using the add_recipe() function did not work.

library(RCurl)
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(tm)
library(SnowballC) 
library(discrim) 


sms_raw <- getURL("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/sms_spam.csv")
sms_raw <- read_csv(sms_raw)
sms_raw$type <- factor(sms_raw$type)

set.seed(123)
split <- initial_split(sms_raw, prop = 0.8, strata = "type")
nb_train_sms <- training(split)
nb_test_sms <- testing(split)

# Text preprocessing
reci_sms <- 
  recipe(type ~.,
         data = nb_train_sms) %>% 
  step_mutate(text = str_to_lower(text)) %>% 
  step_mutate(text = removeNumbers(text)) %>% 
  step_mutate(text = removePunctuation(text)) %>% 
  step_tokenize(text) %>% 
  step_stopwords(text, custom_stopword_source = stopwords()) %>% 
  step_stem(text) %>% 
  step_tokenfilter(text, min_times = 6, max_tokens = 1500) %>% 
  step_tf(text, weight_scheme = "binary") %>% 
  step_mutate_at(contains("tf"), fn =function(x){ifelse(x == TRUE, "Yes", "No")}) %>% 
  prep()


df_training <- juice(reci_sms)
df_testing <- bake(reci_sms, new_data = nb_test_sms)

nb_model <- naive_Bayes() %>% 
  set_engine("klaR") 

Here are three examples of codes that actually produce a valid output

# --------- works but slow -----
nb_fit <- nb_fit <- workflow() %>%
  add_model(nb_model) %>%
  add_formula(type~.) %>%
  fit(df_training)
nb_tidy_pred <- nb_fit %>% predict(df_testing)


# --------- works  -----
nb_fit <- nb_model %>% fit(type ~., df_training)
nb_tidy_pred <- nb_fit %>% predict(df_testing)


# --------- works  -----

nb_fit <- workflow() %>%
  add_model(nb_model) %>%
  add_variables(outcomes = type, predictors = everything()) %>%
  fit(df_training)

nb_tidy_pred <- nb_fit %>% predict(df_testing)

While the following code does not work

nb_fit <- workflow() %>%
  add_model(nb_model) %>%
  add_recipe(reci_sms) %>%
  fit(data = df_training)

nb_tidy_pred <- nb_fit %>% predict(df_testing)

It also throws the following error, but I don't really understand what going on when using rlang::last_error()

Not all variables in the recipe are present in the supplied training set: 'text'.
Run `rlang::last_error()` to see where the error occurred.

Can someone tell me what I am missing ?

1

There are 1 best solutions below

0
On BEST ANSWER

When you are using a recipe in a workflow, then you combine the preprocessing steps with the model fitting. And when fitting that workflow, you need to use the data that the recipe is expecting (nb_train_sms) not the data that the parsnip model is expecting.

Furthermore, it is not recommended to pass a prepped recipe to a workflow, so see how we don't prep() before adding it to the workflow with add_recipe().

library(RCurl)
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(tm) 
library(discrim)

sms_raw <- getURL("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/sms_spam.csv")
sms_raw <- read_csv(sms_raw)
sms_raw$type <- factor(sms_raw$type)

set.seed(123)
split <- initial_split(sms_raw, prop = 0.8, strata = "type")
nb_train_sms <- training(split)
nb_test_sms <- testing(split)

# Text preprocessing
reci_sms <- 
  recipe(type ~.,
         data = nb_train_sms) %>% 
  step_mutate(text = str_to_lower(text)) %>% 
  step_mutate(text = removeNumbers(text)) %>% 
  step_mutate(text = removePunctuation(text)) %>% 
  step_tokenize(text) %>% 
  step_stopwords(text, custom_stopword_source = stopwords()) %>% 
  step_stem(text) %>% 
  step_tokenfilter(text, min_times = 6, max_tokens = 1500) %>% 
  step_tf(text, weight_scheme = "binary")  %>% 
  step_mutate_at(contains("tf"), fn = function(x){ifelse(x == TRUE, "Yes", "No")})

nb_model <- naive_Bayes() %>% 
  set_engine("klaR") 

nb_fit <- workflow() %>%
  add_model(nb_model) %>%
  add_recipe(reci_sms) %>%
  fit(data = nb_train_sms)
#> Warning: max_features was set to '1500', but only 1141 was available and
#> selected.

nb_tidy_pred <- nb_fit %>% predict(nb_train_sms)

Created on 2021-04-19 by the reprex package (v1.0.0)