forcats::fct_lump_prop() behavior

Question

forcats::fct_lump_prop() behavior

68 Views Asked by Michaël Weber At 14 November 2022 at 20:41

I have been struggling using the {forcats} fct_lump_prop() function, more specifically the use of its w = and prop = arguments. What exactly is it supposed to represent with the example below :

df <- tibble(var1 = c(rep("a", 3), rep("b", 3)),
             var2 = c(0.2, 0.3, 0.2, 0.1, 0.1, 0.1))

# A tibble: 6 x 2
  var1   var2
  <chr> <dbl>
1 a       0.2
2 a       0.3
3 a       0.2
4 b       0.1
5 b       0.1
6 b       0.1

What would be the prop value to set so that a is kept intact, but b is lumped as Others ?

# A tibble: 6 x 2
  var1   var2  fct
  <chr> <dbl> <fct>
1 a       0.2   a
2 a       0.3   a
3 a       0.2   a
4 b       0.1   Others
5 b       0.1   Others
6 b       0.1   Others

I have tried empirically but can't find how this works. Intuitively I would say that setting a prop value above 0.3 (sum of bs) and below 0.6 (sum of as) would lump the b factor but this doesn't lump anything. The only threshold I found was with prop = 0.7 and then everything becomes a Others level...

df %>%
  mutate(fct = fct_lump_prop(var1, w = var2, prop = 0.7))

# A tibble: 6 x 3
  var1   var2 fct  
  <fct> <dbl> <fct>
1 a       0.2 Other
2 a       0.3 Other
3 a       0.2 Other
4 b       0.1 Other
5 b       0.1 Other
6 b       0.1 Other

So what am I not understanding ? The little examples I found elsewhere did not help me grasp the prop and w behavior.

Thanks a lot.

Original Q&A

There are 1 best solutions below

**Ric** · Accepted Answer · 2022-11-14T21:36:34.217000

Relevant explanation is in this line of the source of fct_lump_prop()

if (prop > 0 && sum(prop_n <= prop) <= 1) {
   return(f) # No lumping needed
}

That is, if there are only one factor in "others" the function does nothing.

library(forcats)
library(dplyr)
df <- tibble(var1 = c(rep("a", 3), rep("b", 3),rep("c", 3)),
             var2 = (1:9)/(45))

df %>% group_by(var1) %>% summarise(sum(var2))
#> # A tibble: 3 × 2
#>   var1  `sum(var2)`
#>   <chr>       <dbl>
#> 1 a           0.133
#> 2 b           0.333
#> 3 c           0.533

df %>%
  mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.45))
#> # A tibble: 9 × 3
#>   var1    var2 fct  
#>   <chr>  <dbl> <fct>
#> 1 a     0.0222 Other
#> 2 a     0.0444 Other
#> 3 a     0.0667 Other
#> 4 b     0.0889 Other
#> 5 b     0.111  Other
#> 6 b     0.133  Other
#> 7 c     0.156  c    
#> 8 c     0.178  c    
#> 9 c     0.2    c

df %>%
  mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.25))
#> # A tibble: 9 × 3
#>   var1    var2 fct  
#>   <chr>  <dbl> <fct>
#> 1 a     0.0222 a    
#> 2 a     0.0444 a    
#> 3 a     0.0667 a    
#> 4 b     0.0889 b    
#> 5 b     0.111  b    
#> 6 b     0.133  b    
#> 7 c     0.156  c    
#> 8 c     0.178  c    
#> 9 c     0.2    c

^{Created on 2022-11-14 with reprex v2.0.2}

It does'nt appear to be documented except in the source code itself.

Edit

There is also an issue in github because manpage says that factors are lumper if they appear "fewer than" prop times, but it is also true if it appears "exactly" prop times.

forcats::fct_lump_prop() behavior

There are 1 best solutions below

Edit

Related Questions in R

Related Questions in TIDYVERSE

Related Questions in FACTORS

Related Questions in FORCATS

Trending Questions

Popular # Hahtags

Popular Questions