Column name being duplicated in recipe

405 Views Asked by At

This is the piece of code i'm having troubles with:

pump_recipe <- recipe(status_group ~ ., data = data) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_knn(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

prepared_rec <- prep(pump_recipe)

The error:

Error:
! Column name `funder_W.D...I.` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
  * "funder_W.D...I." at locations 1807 and 1808.
Backtrace:
  1. recipes::prep(pump_recipe)
  2. recipes:::prep.recipe(pump_recipe)
  4. recipes:::bake.step_dummy(x$steps[[i]], new_data = training)
  8. tibble:::as_tibble.data.frame(indicators)
  9. tibble:::lst_to_tibble(unclass(x), .rows, .name_repair)
     ...
 16. vctrs `<fn>`()
 17. vctrs:::validate_unique(names = names, arg = arg)
 18. vctrs:::stop_names_must_be_unique(names, arg)
 19. vctrs:::stop_names(...)
 20. vctrs:::stop_vctrs(class = c(class, "vctrs_error_names"), ...)
 Error: 
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.

So basically it seems like the step_dummy step is doing something strange, and creating a duplicated column here. I don't know why this is happening. This is the data I'm working with:

https://github.com/norhther/datasets/blob/main/data.csv

1

There are 1 best solutions below

0
On BEST ANSWER

You are having levels in funder and installer that are so similar that step_dummy() creates labels of the same name. The error says that funder_W.D...I. appears twice.

If we do some filtering on the funder column we see that there are 3 different names that match.

str_subset(data$funder, "W.D") |> unique()
[1] "W.D.&.I." "W.D & I." "W.D &"   

Neither "W.D.&.I." or "W.D & I." are valid names so step_dummy() tries to fix them. This yields "funder_W.D...I." for both.

You can fix this by using textrecipes::step_clean_levels(), this make sure that the levels of these variables stay valid and non-overlapping.

library(recipes)

pump_recipe <- recipe(status_group ~ ., data = data) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_knn(all_nominal_predictors()) %>%
  textrecipes::step_clean_levels(funder, installer) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

prepared_rec <- prep(pump_recipe)

Note: As you say, I would imagine that "W.D.&.I.", "W.D & I." and "W.D &" all refer to the same entity. You should take a look to see if you can collapse these levels manually.