Issue using "pred_yes" column as the estimate argument to roc_curve()

319 Views Asked by At

When I run the below data it shows an incorrect roc_curve.

Prep

The below code should be run-able for anyone using r-studio. The dataframe contains characteristics of different employees regarding: performance ratings, sales figures, and whether or not they were promoted.

I am attempting to create a decision tree model that uses all other variables to predict if an employee was promoted. The primary purpose of this question is to find out what I am doing incorrectly when tring to use the roc_curve() function.

library(tidyverse)
library(tidymodels)
library(peopleanalyticsdata)
    

url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
    
   

salespeople <- read.csv(url)
    
    
salespeople <- salespeople %>% mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
    

creating testing/training data

Using my own homemade train_test() function just for kicks!

    train_test <- function(data, train.size=0.7, na.rm=FALSE) {
      if(na.rm == TRUE) {
        dt <- sample(x=nrow(data), size=nrow(data)* train.size)
        data_nm <- na.omit(data)
        train<-data_nm[dt,]
        test<- data_nm[-dt,]
        set <- list(train, test)
        names(set) <- c("train", "test")
        return(set) 
      } else {
        dt <- sample(x=nrow(data), size=nrow(data)* train.size)
        train<-data[dt,]
        test<- data[-dt,]
        set <- list(train, test)
        names(set) <- c("train", "test")
        return(set)  
      }
    }
    
    tt_list <- train_test(salespeople)
    
    sales_train <- tt_list$train
    
    sales_test <- tt_list$test
    
  '''  

creating decision tree model structure/final model/prediction dataframe

'''    
tree <- decision_tree() %>%
          set_engine("rpart") %>%
          set_mode("classification") 


    model <- tree %>% fit(promoted ~ ., data = sales_train)
    
   

    predictions <- predict(model, 
                           sales_test,
                           type = "prob") %>% 
      bind_cols(sales_test)
    
'''    
   

Calculate & Plot the ROC curve

When I use the .pred_yes column as the estimate column, it calculates an ROC curve that is the inverse of what I want. It seems that it has identified .pred_no as the "real" estimate column

 '''

roc <- roc_curve(predictions, 
   estimate = .pred_yes, 
                         truth = promoted)
        
       

        autoplot(roc)

    '''

Thoughts

Seems like the issue goes away when I supply pred_no as the estimate column to roc_curve()

FYI: this is my first stack overflow post, if you have any suggestions to make this post more clear/better formatted please let me know!

1

There are 1 best solutions below

0
Julia Silge On

In factor(c("yes", "no")), "no" is the first level, the level that most modeling packages assume is the one of interest. In tidymodels, you can adjust the level of interest via the event_level argument, as documented here:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read_csv(url) %>% 
    mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
#> Rows: 351 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): promoted, sales, customer_rate, performance
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sales_split <- initial_split(salespeople)
sales_train <- training(sales_split)
sales_test <- testing(sales_split)

tree <- decision_tree() %>%
    set_engine("rpart") %>%
    set_mode("classification") 


tree_fit <- tree %>% fit(promoted ~ ., data = sales_train)
sales_preds <- augment(tree_fit, sales_test)
sales_preds
#> # A tibble: 88 × 7
#>    promoted sales customer_rate performance .pred_class .pred_no .pred_yes
#>    <fct>    <dbl>         <dbl>       <dbl> <fct>          <dbl>     <dbl>
#>  1 no         364          4.89           1 no             0.973    0.0267
#>  2 no         342          3.74           3 no             0.973    0.0267
#>  3 yes        716          3.16           3 yes            0        1     
#>  4 no         450          3.21           3 no             0.973    0.0267
#>  5 no         372          3.87           3 no             0.973    0.0267
#>  6 no         535          4.47           2 no             0.973    0.0267
#>  7 yes        736          3.94           4 yes            0        1     
#>  8 no         330          2.54           2 no             0.973    0.0267
#>  9 no         478          3.48           2 no             0.973    0.0267
#> 10 yes        728          2.66           3 yes            0        1     
#> # … with 78 more rows

sales_preds %>%
    roc_curve(promoted, .pred_yes, event_level = "second") %>%
    autoplot()

Created on 2021-09-08 by the reprex package (v2.0.1)