When I run the below data it shows an incorrect roc_curve.
Prep
The below code should be run-able for anyone using r-studio. The dataframe contains characteristics of different employees regarding: performance ratings, sales figures, and whether or not they were promoted.
I am attempting to create a decision tree model that uses all other variables to predict if an employee was promoted. The primary purpose of this question is to find out what I am doing incorrectly when tring to use the roc_curve() function.
library(tidyverse)
library(tidymodels)
library(peopleanalyticsdata)
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read.csv(url)
salespeople <- salespeople %>% mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
creating testing/training data
Using my own homemade train_test() function just for kicks!
train_test <- function(data, train.size=0.7, na.rm=FALSE) {
if(na.rm == TRUE) {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
data_nm <- na.omit(data)
train<-data_nm[dt,]
test<- data_nm[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
} else {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
train<-data[dt,]
test<- data[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
}
}
tt_list <- train_test(salespeople)
sales_train <- tt_list$train
sales_test <- tt_list$test
'''
creating decision tree model structure/final model/prediction dataframe
'''
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
model <- tree %>% fit(promoted ~ ., data = sales_train)
predictions <- predict(model,
sales_test,
type = "prob") %>%
bind_cols(sales_test)
'''
Calculate & Plot the ROC curve
When I use the .pred_yes column as the estimate column, it calculates an ROC curve that is the inverse of what I want. It seems that it has identified .pred_no as the "real" estimate column
'''
roc <- roc_curve(predictions,
estimate = .pred_yes,
truth = promoted)
autoplot(roc)
'''
Thoughts
Seems like the issue goes away when I supply pred_no as the estimate column to roc_curve()
FYI: this is my first stack overflow post, if you have any suggestions to make this post more clear/better formatted please let me know!
In
factor(c("yes", "no")), "no" is the first level, the level that most modeling packages assume is the one of interest. In tidymodels, you can adjust the level of interest via theevent_levelargument, as documented here:Created on 2021-09-08 by the reprex package (v2.0.1)