When I run the below data it shows an incorrect roc_curve.
Prep
The below code should be run-able for anyone using r-studio. The dataframe contains characteristics of different employees regarding: performance ratings, sales figures, and whether or not they were promoted.
I am attempting to create a decision tree model that uses all other variables to predict if an employee was promoted. The primary purpose of this question is to find out what I am doing incorrectly when tring to use the roc_curve() function.
library(tidyverse)
library(tidymodels)
library(peopleanalyticsdata)
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read.csv(url)
salespeople <- salespeople %>% mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
creating testing/training data
Using my own homemade train_test() function just for kicks!
train_test <- function(data, train.size=0.7, na.rm=FALSE) {
if(na.rm == TRUE) {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
data_nm <- na.omit(data)
train<-data_nm[dt,]
test<- data_nm[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
} else {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
train<-data[dt,]
test<- data[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
}
}
tt_list <- train_test(salespeople)
sales_train <- tt_list$train
sales_test <- tt_list$test
'''
creating decision tree model structure/final model/prediction dataframe
'''
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
model <- tree %>% fit(promoted ~ ., data = sales_train)
predictions <- predict(model,
sales_test,
type = "prob") %>%
bind_cols(sales_test)
'''
Calculate & Plot the ROC curve
When I use the .pred_yes column as the estimate column, it calculates an ROC curve that is the inverse of what I want. It seems that it has identified .pred_no as the "real" estimate column
'''
roc <- roc_curve(predictions,
estimate = .pred_yes,
truth = promoted)
autoplot(roc)
'''
Thoughts
Seems like the issue goes away when I supply pred_no as the estimate column to roc_curve()
FYI: this is my first stack overflow post, if you have any suggestions to make this post more clear/better formatted please let me know!
In
factor(c("yes", "no"))
, "no" is the first level, the level that most modeling packages assume is the one of interest. In tidymodels, you can adjust the level of interest via theevent_level
argument, as documented here:Created on 2021-09-08 by the reprex package (v2.0.1)