How to understand the trimmed mean in Scipy

66 Views Asked by At

I can't explain the behaviour of trim_mean() in Scipy.stats.

I learned that trimmed mean calculates the average of a series of numbers after discarding given parts of a probability distribution.

In the following example, I got the result as 6.1111

from scipy.stats import trim_mean

data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
trim_percentage = 0.05  # Trim 5% from each end

result = trim_mean(sorted(data), trim_percentage)
print(f"result = {result}")

result = 6.111111111111111

However, I expect that 1 and 30 will be cut out, because they fall under the 5 percentile and above the 95 percentile.

When I do it manually:

import numpy as np

data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
p5, p95 = np.percentile(data, [5, 95])
print(f"The 5th percentile = {p5}\nThe 95th percentile = {p95}")

trim_average = np.mean(list(filter(lambda x: x if p5 < x < p95 else 0, data)))
print(f"trimmed average = {trim_average}")

I got this:

The 5th percentile = 1.4

The 95th percentile = 19.999999999999993

trimmed average = 3.4285714285714284

Does this mean the trim_mean() treats each number separately and assumes a uniform distribution? The proportiontocut is explained as "Fraction to cut off of both tails of the distribution". But why it behaves like if the distribution were not considered?

1

There are 1 best solutions below

3
Matt Haberland On BEST ANSWER

The phrasing in the documentation should be more precise: it cuts a fraction of the observations in your sample. You have 9 values, and 5% of 9 values is 0.45 values. However, it can't cut off a fraction of a value. The documentation states that it

Slices off less if proportion results in a non-integer slice index

So in your case, zero values are cut from both ends before taking the mean.

import numpy as np
from scipy import stats
x = [1, 2, 2, 3, 4, 30, 4, 4, 5]
np.mean(x)  # 6.111111111111111
stats.trim_mean(x, 0.05)  # 6.111111111111111

You can verify that the result changes when proportiontocut exceeds 1/len(data):

from scipy import stats
x = [1, 2, 2, 3, 4, 30, 4, 4, 5]
p = 1 / len(x)
eps = 1e-15
stats.trim_mean(x, p-eps)  # 6.111111111111111
stats.trim_mean(x, p+eps)  # 3.4285714285714284

This behavior appears to be consistent with the description of a trimmed mean on Wikipedia, at least:

This number of points to be discarded is usually given as a percentage of the total number of points, but may also be given as a fixed number of points... For example, given a set of 8 points, trimming by 12.5% would discard the minimum and maximum value in the sample: the smallest and largest values, and would compute the mean of the remaining 6 points.

SciPy does not have a function that trims based on percentiles (of which there are many conventions). For that, you'd need to write your own function, or perhaps there is such a function in another library.

Please consider opening an issue about improving the documentation.