I have been struggling using the {forcats} fct_lump_prop() function, more specifically the use of its w = and prop = arguments. What exactly is it supposed to represent with the example below :
df <- tibble(var1 = c(rep("a", 3), rep("b", 3)),
var2 = c(0.2, 0.3, 0.2, 0.1, 0.1, 0.1))
# A tibble: 6 x 2
var1 var2
<chr> <dbl>
1 a 0.2
2 a 0.3
3 a 0.2
4 b 0.1
5 b 0.1
6 b 0.1
What would be the prop value to set so that a is kept intact, but b is lumped as Others ?
# A tibble: 6 x 2
var1 var2 fct
<chr> <dbl> <fct>
1 a 0.2 a
2 a 0.3 a
3 a 0.2 a
4 b 0.1 Others
5 b 0.1 Others
6 b 0.1 Others
I have tried empirically but can't find how this works. Intuitively I would say that setting a prop value above 0.3 (sum of bs) and below 0.6 (sum of as) would lump the b factor but this doesn't lump anything. The only threshold I found was with prop = 0.7 and then everything becomes a Others level...
df %>%
mutate(fct = fct_lump_prop(var1, w = var2, prop = 0.7))
# A tibble: 6 x 3
var1 var2 fct
<fct> <dbl> <fct>
1 a 0.2 Other
2 a 0.3 Other
3 a 0.2 Other
4 b 0.1 Other
5 b 0.1 Other
6 b 0.1 Other
So what am I not understanding ? The little examples I found elsewhere did not help me grasp the prop and w behavior.
Thanks a lot.
Relevant explanation is in this line of the source of
fct_lump_prop()That is, if there are only one factor in "others" the function does nothing.
Created on 2022-11-14 with reprex v2.0.2
It does'nt appear to be documented except in the source code itself.
Edit
There is also an issue in github because manpage says that factors are lumper if they appear "fewer than" prop times, but it is also true if it appears "exactly" prop times.