In a boxplot I've set the option outline=FALSE to remove the outliers.
Now I'd like to include points that show the mean in the boxplot. Obviously, the means calculated using mean include the outliers.

How can the very same outliers be removed from a dataframe so that the calculated mean corresponds to the data shown in the boxplot?

I know how outliers can be removed, but which settings are used by the outline option from boxplot internally? Unfortunately, the manual does not give any clarifications.

3

There are 3 best solutions below

0
On BEST ANSWER

To remove the outliers, you must set the option outline to FALSE.

Let's assume that your data are the following:

data <- data.frame(a = c(seq(0,1,0.1),3))

Then, you use the boxplot function:

res <- boxplot(data, outline=FALSE)

In the res object, you have several pieces of information about your data. Among these, res$out gives you all the outliers. Here there is only the value 3.

Thus, to compute the mean without the outliers, you can simply do:

mean(data$a[!data$a %in% res$out])
1
On

If you look at the Value section of ?boxplot, you find:

"List with the following components:" [...]

out the values of any data points which lie beyond the extremes of the whiskers."

Thus, you can assing the result of your boxplot call to an object, extract the outliers, and remove them from the original values:

x <- c(-10, 1:5, 50)
x
# [1] -10   1   2   3   4   5  50

bx <- boxplot(x)
str(bx)
# List of 6
# $ stats: num [1:5, 1] 1 1.5 3 4.5 5
# $ n    : num 7
# $ conf : num [1:2, 1] 1.21 4.79
# $ out  : num [1:2] -10 50
# $ group: num [1:2] 1 1
# $ names: chr "1"

x2 <- x[!(x %in% bx$out)]
x2
# [1] 1 2 3 4 5
1
On

To answer the second part of your question, about how the outliers are choosen, it's good to remind how the boxplot is constructed:

  • the "body" of the boxplot corresponds to the second + third quartiles of the data (= interquartile range, IQR)
  • each whisker limit is generally calculated taking 1.5*IQR beyond the end of that body.

If you take the hypothesis that your data has a normal distribution, there are this amount of data outside each whisker:

1-pnorm(qnorm(0.75)+1.5*2*qnorm(0.75))

being 0.0035. Therefore, a normal variable has 0.7% of "boxplot outliers".

But this is not a very "reliable" way to detect outliers, there are packages specifically designed for this.