I'm currently trying to exclude outliers based on a subset of selected variables with the aim of performing sensitivity analyses. I've adapted the function available here: calculating the outliers in R), but have been unsuccesful so far (I'm still a novice R user). Please let me know if you have any suggestions!
df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011),
measure1 = rnorm(11, mean = 8, sd = 4),
measure2 = rnorm(11, mean = 40, sd = 5),
measure3 = rnorm(11, mean = 20, sd = 2),
measure4 = rnorm(11, mean = 9, sd = 3))
vars_of_interest <- c("measure1", "measure3", "measure4")
# define a function to remove outliers
FindOutliers <- function(data) {
lowerq = quantile(data)[2]
upperq = quantile(data)[4]
iqr = upperq - lowerq #Or use IQR(data)
# we identify extreme outliers
extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)
result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)
}
# use the function to identify outliers
temp <- FindOutliers(df[vars_of_interest])
# remove the outliers
testData <- testData[-temp]
# show the data with the outliers removed
testData
Separate the concerns:
I would suggest returning a boolean vector rather than indices. This way, the returned value is the size of the data which makes it easy to create a new column, for example
df$outlier <- is_outlier(df$measure1)
.Note how the argument names make it clear which type of input is expected:
x
is a standard name for a numeric vector anddf
is obviously a data.frame.cols
is probably a list or vector of column names.I made a point to only use base R but in real life I would use the
dplyr
package to manipulate the data.frame.Armed with these 2 functions, it becomes very easy:
Created on 2020-03-23 by the reprex package (v0.3.0)