We have been given a large dataset, and we are being asked to remove outliers using the IQR method using R.
The data has 53 columns, 17 of which are continuous, the remaining are categorical. How would you use the IQR method to remove outliers to multiple columns, and update the data frame?
I found this function online, but it does not account for columns that have categorical values.
sample_data <- data.frame(x=c(1, 2, 3, 4, 3, 2, 3, 4, 4, 5, 0),
y=c(4, 3, 5, 7, 8, 5, 9, 7, 6, 5, 0),
z=c(1, 3, 2, 9, 8, 7, 0, 8, 7, 2, 3))
print("Display original dataframe")
print(sample_data)
detect_outlier <- function(x) {
# calculate first quantile
Quantile1 <- quantile(x, probs=.25)
# calculate third quantile
Quantile3 <- quantile(x, probs=.75)
# calculate inter quartile range
IQR = Quantile3-Quantile1
# return true or false
x > Quantile3 + (IQR*1.5) | x < Quantile1 - (IQR*1.5)
}
# create remove outlier function
remove_outlier <- function(dataframe,
columns=names(dataframe)) {
# for loop to traverse in columns vector
for (col in columns) {
# remove observation if it satisfies outlier function
dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
}
# return dataframe
print("Remove outliers")
print(dataframe)
}
remove_outlier(sample_data, c('x', 'y', 'z', 'w'))
########these are all my variables
'car.deliver.airport.num','car.deliver.hotel.num','car.deliver.train.station.num','car.displayed.turo.review.num', 'car.displayed.turo.review.num.past.12m','car.displayed.turo.review.num.past.18m','car.displayed.turo.review.num.past.6m','car.displayed.user.review.num','car.displayed.user.review.num.past.12m','car.displayed.user.review.num.past.18m','car.displayed.user.review.num.past.6m','car.extra.mile.fee','car.extra.num','car.extra.phone.mount','car.extra.portable.gps','car.extra.post.trip.cleaning','car.extra.prepaid.ev.recharge','car.extra.prepaid.refuel','car.extra.stroller','car.extra.unlimited.mileage','car.faq.num','car.instant.book','car.insurance','car.miles.included','car.photo.num','car.trip.price','host.car.num'
Original output
#Output after
# x y z
#1 1 4 1
#2 2 3 3
#3 3 5 2
#4 4 7 9
#5 3 8 8
#6 2 5 7
#7 3 9 0
#8 4 7 8
#9 4 6 7
#10 5 5 2
#11 0 0 3
#Output after
# x y z
#1 1 4 1
#2 2 3 3
#3 3 5 2
#4 4 7 9
#5 3 8 8
#6 2 5 7
#7 3 9 0
#8 4 7 8
#9 4 6 7
#10 5 5 2
I'm expecting for outliers to be removed from the original dataframe, for only continuous variables.
First I have modified your function, replace outliers with NA and allow modifying the IQR times
Select which columns to process (numeric data)
Finally apply the function to the data frame.