I want to winsorize my data, which looks like following (in total 134 observations):

                         company   id    rev   size age 
1                           Adeg 29.9   0.66    160  45     
2                         Agrana 32.0   2.80   9191  29     
3                        Allianz 36.5  87.75 142460 128     
4                        Andritz 34.0   6.89  29096 118     
5                          Apple 41.0 259.65 132000  41

To use the winsorize function from DescToolspackage, I created a single numeric vector of variable rev, by simply using the select function: rev_vector <- select(data1, -...)

I then ran the function as following, which gives me an error:

> Winsorize(rev_vector)
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : 
  undefined columns selected

Is this caused since i implement a data.frame instead of a vector? Alternatively, I tried the following:

> Winsorize(rev_vector$rev, probs = c(0.05, 0.95))
  [1]   0.66   2.80  87.75   6.89 134.73   0.09  22.78   1.36   5.48   0.70   0.79   0.35  31.37   0.55   0.94   0.06
 [17]  12.36  13.58   7.95   0.29   7.80   0.39  73.55   0.09  23.07   0.27   0.32   0.08   0.05   0.41  29.47   0.66
 [33]  20.91   0.67   0.05   1.39   0.17   0.14   1.79   0.05   2.52   3.68   0.24   0.09 109.65   8.43   0.20   0.17
 [49]  35.93   3.05   0.07   0.05   0.82   0.57  26.21   0.28   0.05   5.72   6.12   4.09   0.05   0.22 134.73  94.43
 [65]  41.35   0.20  17.32   5.63   3.25   0.12   0.05   0.07  10.89   3.79   1.89 134.73   9.98  10.58  54.98 134.73
 [81]  15.55  15.21   5.93  42.65   1.59   3.00  11.19   6.10   0.08 134.73  31.37  17.74  20.92   6.46   3.18   0.05
 [97]   0.81   9.15  29.47   0.05   1.34   7.97 109.65  28.45  35.93   0.38   0.65 134.73   9.44   8.66   5.30  11.83
[113]  20.06  29.55   1.15   2.32  46.14 134.73   9.98  10.58  11.05  54.98 134.73  15.55  15.21   5.93   1.59   1.03
[129]   3.00  11.19   6.10

I am not sure about what the outcome means? Since I don't think that the winsorize actually worked when looking at the summary of the vector: summary(rev_vector$rev), it is unchanged to the one previous winsorizing.

Can somebody help me out here? Thanks!

1

There are 1 best solutions below

2
On

You are almost there, only that you chose restrictive probs for the quantiles. Your vector has already a considerable number of equal values at its edges. Has it perhaps already been winsorized before?

library(DescTools)

x <-  c(0.66, 2.8, 87.75, 6.89, 134.73, 0.09, 22.78, 1.36, 
        5.48, 0.7, 0.79, 0.35, 31.37, 0.55, 0.94, 0.06, 12.36, 13.58, 
        7.95, 0.29, 7.8, 0.39, 73.55, 0.09, 23.07, 0.27, 0.32, 0.08, 
        0.05, 0.41, 29.47, 0.66, 20.91, 0.67, 0.05, 1.39, 0.17, 0.14, 
        1.79, 0.05, 2.52, 3.68, 0.24, 0.09, 109.65, 8.43, 0.2, 0.17, 
        35.93, 3.05, 0.07, 0.05, 0.82, 0.57, 26.21, 0.28, 0.05, 5.72, 
        6.12, 4.09, 0.05, 0.22, 134.73, 94.43, 41.35, 0.2, 17.32, 5.63, 
        3.25, 0.12, 0.05, 0.07, 10.89, 3.79, 1.89, 134.73, 9.98, 10.58, 
        54.98, 134.73, 15.55, 15.21, 5.93, 42.65, 1.59, 3, 11.19, 6.1, 
        0.08, 134.73, 31.37, 17.74, 20.92, 6.46, 3.18, 0.05, 0.81, 9.15, 
        29.47, 0.05, 1.34, 7.97, 109.65, 28.45, 35.93, 0.38, 0.65, 134.73, 
        9.44, 8.66, 5.3, 11.83, 20.06, 29.55, 1.15, 2.32, 46.14, 134.73, 
        9.98, 10.58, 11.05, 54.98, 134.73, 15.55, 15.21, 5.93, 1.59, 
        1.03, 3, 11.19, 6.1)

summary() is in this case somewhat coarse.

summary(Winsorize(x))
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.05    0.48    5.48   19.73   17.53  134.73 

Using Desc() gives you a more detailed idea what's going on in your data.

Desc(Winsorize(x))

# -----------------------------------------------------    
# Winsorize(x) (numeric)
#
#  length       n    NAs  unique     0s   mean  meanCI
#     131     131      0      95      0  19.73   13.53
#          100.0%   0.0%           0.0%          25.92
#                                                     
#     .05     .10    .25  median    .75    .90     .95
#    0.05    0.08   0.48    5.48  17.53  54.98  134.73
#                                                     
#   range      sd  vcoef     mad    IQR   skew    kurt
#  134.68   35.84   1.82    7.87  17.05   2.35    4.42
#                                                     
# lowest : 0.05 (9), 0.06, 0.07 (2), 0.08 (2), 0.09 (3)
# highest: 73.55, 87.75, 94.43, 109.65 (2), 134.73 (8)

You see, that you have 9 times the value 0.05 and 8 times the value 134.73. So the quantiles with probs 0.05 and 0.95 are the same as the extremes and the winsorized vector remains the same as the original one.

quantile(x=x, probs=c(0.05, 0.95))
#    5%    95% 
#  0.05 134.73 

Simply increase the probs to say c(0.1, 0.9) and you'll see the effect.

PS: Winsorize() needs a vector as argument and can't handle data.frames. (This is also so described in the help file…)

PPS: a reproducible example would help… ;-)