I am trying to implement a winsorization function but get confused by the exact definition of it. Obviously, the winsorize function in R
package, DescTool
, and the winsorize function in Python
library, scipy.stats.mstats
, yield different results. I am a little surprised by this as both functions are very popular but nobody seems to care about the difference. Here is a simple test:
In R
library(DescTools)
data <- seq(0, 99)
Winsorize(data, probs=c(0.025, 1-0.025))
The result is [2.475, 2.475, 2.475, 3., 4., 5., 6., ..., 96., 96.525, 96.525, 96.525]
.
However, in Python,
import numpy as np
from scipy.stats.mstats import winsorize
data = np.arange(100).astype(np.float)
new_data = winsorize(data, [0.025, 0.025])
new_data
The result is [2., 2., 2., 3., 4., 5., 6., ..., 96., 97., 97. ,97.]
.
What makes it even worse is that based on Wikipedia's example, it should be [3., 3., 3., 3., 4., 5., 6., ..., 96., 96., 96. ,96.]
because the 2.5th percentile is 2.475, which fells between 2 and 3 and therefore, everything less than 2.475 should round to 3.
Does anybody know which version I should implement?
Thanks
It seems to be a difference in how the quantile is defined. R uses a continuous quantile function by default, which is described in
?quantile
's list of 9 types of quantiles under "Type 7". If you usetype = 1
inDescTools::Winsorize
, the results seem to match winsorize fromscipy.stats.mstats
(just based on the output shown in the question).None of the 9 methods produce the output shown on the Wikipedia page for that example. There's no citation there though so I wouldn't put too much thought into it.