I can't explain the behaviour of trim_mean() in Scipy.stats.
I learned that trimmed mean calculates the average of a series of numbers after discarding given parts of a probability distribution.
In the following example, I got the result as 6.1111
from scipy.stats import trim_mean
data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
trim_percentage = 0.05 # Trim 5% from each end
result = trim_mean(sorted(data), trim_percentage)
print(f"result = {result}")
result = 6.111111111111111
However, I expect that 1 and 30 will be cut out, because they fall under the 5 percentile and above the 95 percentile.
When I do it manually:
import numpy as np
data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
p5, p95 = np.percentile(data, [5, 95])
print(f"The 5th percentile = {p5}\nThe 95th percentile = {p95}")
trim_average = np.mean(list(filter(lambda x: x if p5 < x < p95 else 0, data)))
print(f"trimmed average = {trim_average}")
I got this:
The 5th percentile = 1.4
The 95th percentile = 19.999999999999993
trimmed average = 3.4285714285714284
Does this mean the trim_mean() treats each number separately and assumes a uniform distribution? The proportiontocut is explained as "Fraction to cut off of both tails of the distribution". But why it behaves like if the distribution were not considered?
The phrasing in the documentation should be more precise: it cuts a fraction of the observations in your sample. You have 9 values, and 5% of 9 values is 0.45 values. However, it can't cut off a fraction of a value. The documentation states that it
So in your case, zero values are cut from both ends before taking the mean.
You can verify that the result changes when
proportiontocutexceeds1/len(data):This behavior appears to be consistent with the description of a trimmed mean on Wikipedia, at least:
SciPy does not have a function that trims based on percentiles (of which there are many conventions). For that, you'd need to write your own function, or perhaps there is such a function in another library.
Please consider opening an issue about improving the documentation.