Issue using "pred_yes" column as the estimate argument to roc_curve()

310 Views Asked by At

When I run the below data it shows an incorrect roc_curve.

Prep

The below code should be run-able for anyone using r-studio. The dataframe contains characteristics of different employees regarding: performance ratings, sales figures, and whether or not they were promoted.

I am attempting to create a decision tree model that uses all other variables to predict if an employee was promoted. The primary purpose of this question is to find out what I am doing incorrectly when tring to use the roc_curve() function.

library(tidyverse)
library(tidymodels)
library(peopleanalyticsdata)
    

url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
    
   

salespeople <- read.csv(url)
    
    
salespeople <- salespeople %>% mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
    

creating testing/training data

Using my own homemade train_test() function just for kicks!

    train_test <- function(data, train.size=0.7, na.rm=FALSE) {
      if(na.rm == TRUE) {
        dt <- sample(x=nrow(data), size=nrow(data)* train.size)
        data_nm <- na.omit(data)
        train<-data_nm[dt,]
        test<- data_nm[-dt,]
        set <- list(train, test)
        names(set) <- c("train", "test")
        return(set) 
      } else {
        dt <- sample(x=nrow(data), size=nrow(data)* train.size)
        train<-data[dt,]
        test<- data[-dt,]
        set <- list(train, test)
        names(set) <- c("train", "test")
        return(set)  
      }
    }
    
    tt_list <- train_test(salespeople)
    
    sales_train <- tt_list$train
    
    sales_test <- tt_list$test
    
  '''  

creating decision tree model structure/final model/prediction dataframe

'''    
tree <- decision_tree() %>%
          set_engine("rpart") %>%
          set_mode("classification") 


    model <- tree %>% fit(promoted ~ ., data = sales_train)
    
   

    predictions <- predict(model, 
                           sales_test,
                           type = "prob") %>% 
      bind_cols(sales_test)
    
'''    
   

Calculate & Plot the ROC curve

When I use the .pred_yes column as the estimate column, it calculates an ROC curve that is the inverse of what I want. It seems that it has identified .pred_no as the "real" estimate column

 '''

roc <- roc_curve(predictions, 
   estimate = .pred_yes, 
                         truth = promoted)
        
       

        autoplot(roc)

    '''

Thoughts

Seems like the issue goes away when I supply pred_no as the estimate column to roc_curve()

FYI: this is my first stack overflow post, if you have any suggestions to make this post more clear/better formatted please let me know!

1

There are 1 best solutions below

0
On

In factor(c("yes", "no")), "no" is the first level, the level that most modeling packages assume is the one of interest. In tidymodels, you can adjust the level of interest via the event_level argument, as documented here:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read_csv(url) %>% 
    mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
#> Rows: 351 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): promoted, sales, customer_rate, performance
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sales_split <- initial_split(salespeople)
sales_train <- training(sales_split)
sales_test <- testing(sales_split)

tree <- decision_tree() %>%
    set_engine("rpart") %>%
    set_mode("classification") 


tree_fit <- tree %>% fit(promoted ~ ., data = sales_train)
sales_preds <- augment(tree_fit, sales_test)
sales_preds
#> # A tibble: 88 × 7
#>    promoted sales customer_rate performance .pred_class .pred_no .pred_yes
#>    <fct>    <dbl>         <dbl>       <dbl> <fct>          <dbl>     <dbl>
#>  1 no         364          4.89           1 no             0.973    0.0267
#>  2 no         342          3.74           3 no             0.973    0.0267
#>  3 yes        716          3.16           3 yes            0        1     
#>  4 no         450          3.21           3 no             0.973    0.0267
#>  5 no         372          3.87           3 no             0.973    0.0267
#>  6 no         535          4.47           2 no             0.973    0.0267
#>  7 yes        736          3.94           4 yes            0        1     
#>  8 no         330          2.54           2 no             0.973    0.0267
#>  9 no         478          3.48           2 no             0.973    0.0267
#> 10 yes        728          2.66           3 yes            0        1     
#> # … with 78 more rows

sales_preds %>%
    roc_curve(promoted, .pred_yes, event_level = "second") %>%
    autoplot()

Created on 2021-09-08 by the reprex package (v2.0.1)