Alternatives for distinct(.keep_all = TRUE) in arrow?

44 Views Asked by At

I have a larger than memory arrow dataset created by open_dataset() from partitioned parquet files that I need to use distinct(.keep_all = TRUE) on. I need to keep the computation on disk, therefore I'm using arrow to speed things up and not crash my R session.

I want to keep rows that have distinct values in columns a and b.

Example

df <- tibble(a = c(1,1,2,2),
             b = c(1,1,2,1),
             c = c("x", "y", "z", "a")) %>%
  arrow_table()


df %>% 
  distinct(a, b, .keep_all = TRUE)

This results in: Error: distinct() with .keep_all = TRUE not supported in Arrow

Desired Output

An arrow dataset with the following values.

  a     b c    
  <dbl> <dbl> <chr>
1     1     1 x    
2     2     2 z    
3     2     1 a  

I see others have had similar questions but it doesn't seem like arrow plans to incorporate use of .keep_all (see closed issue).

base::duplicated() would also work but it's not supported by arrow either. Any thoughts on how to work around this without using collect() (which crashes my R session)? TIA!

1

There are 1 best solutions below

2
mhovd On

As commented, a grouped slice operation will likely do instead.

library(dplyr)

df <- tibble(a = c(1,1,2,2),
             b = c(1,1,2,1),
             c = c("x", "y", "z", "a"))

df %>% 
  group_by(a, b) %>% 
  slice_head(n = 1)

#> # A tibble: 3 × 3
#> # Groups:   a, b [3]
#>       a     b c    
#>   <dbl> <dbl> <chr>
#> 1     1     1 x    
#> 2     2     1 a    
#> 3     2     2 z

Created on 2024-03-28 with reprex v2.1.0