Checking whether sets of columns are the same, row wise in R, in any order

Question

Checking whether sets of columns are the same, row wise in R, in any order

1k Views Asked by fe108 At 08 January 2024 at 13:35

I am working in R, and would prefer a dplyr solution if possible.

sample data:

data.frame(
  col1 = c("a", "b", "c", "d"),
  col2 = c("a", "b", "d", "a"),
  col3 = rep("a", 4L),
  col4 = c("a", "b", "d", "a"),
  col5 = c("a", "a", "c", "d"),
  col6 = rep(c("b", "a"), each = 2L)
)

col1	col2	col3	col4	col5	col6
a	a	a	a	a	b
b	b	a	b	a	b
c	d	a	d	c	a
d	a	a	a	d	a

Question

I would like to know for each row, whether col1, col2 and col3 are the same as col4, col5 and col6, but the order of col1 - col3 and col4 - col6 should be ignored.

So for row 1, if col1 - col3 contained a,a,b respectively, and col4 - col6 contained b,a,a respectively, then that would be considered a match.

Desired result

Have put a note on "assessment" column to aid understanding

col1	col2	col3	col4	col5	col6	assessment
a	a	a	a	a	b	FALSE (because 1-3 are not same as 4-6)
b	b	a	b	a	b	TRUE (because 1-3 are the same as 4-6, if ignore order)
c	d	a	d	c	a	TRUE (because 1-3 are the same as 4-6, if ignore order)
d	a	a	a	d	a	TRUE (because 1-3 are the same as 4-6, if ignore order)

Original Q&A

There are 8 best solutions below

jpsmith On 08 January 2024 at 13:46

In base R you could use vapply and sort:

df$assessment <- vapply(seq_len(nrow(df)), \(x) 
                        all(sort(unlist(df[x,1:3])) == sort(unlist(df[x,4:6]))), logical(1))

Output:

#   col1 col2 col3 col4 col5 col6 assessment
# 1    a    a    a    a    a    b      FALSE
# 2    b    b    a    b    a    b       TRUE
# 3    c    d    a    d    c    a       TRUE
# 4    d    a    a    a    d    a       TRUE

s_baldur On 08 January 2024 at 13:47

Base R:

df$assessment <- apply(df, 1, \(x) identical(table(x[1:3]), table(x[4:6])))

#   col1 col2 col3 col4 col5 col6 assessment
# 1    a    a    a    a    a    b      FALSE
# 2    b    b    a    b    a    b       TRUE
# 3    c    d    a    d    c    a       TRUE
# 4    d    a    a    a    d    a       TRUE

Reproducible data:

df <- data.frame(
  col1 = c("a", "b", "c", "d"), col2 = c("a", "b", "d", "a"),
  col3 = c("a", "a", "a", "a"), col4 = c("a", "b", "d", "a"),
  col5 = c("a", "a", "c", "d"), col6 = c("b", "b", "a", "a")
)

PS: Why table() and indentical instead of sort(), ==, all()? I would expect it to scale better with the number of columns (given low number of unique values). Example:

df <- as.data.frame(lapply(1:600, \(x) sample(letters, size = 4000, replace = TRUE)))
bench::mark(
  apply(df, 1, \(x) identical(table(x[1:300]), table(x[301:600]))),
  apply(df, 1, \(x) all(sort(x[1:300]) == sort(x[301:600])))
)
#   expression                                                   min median `itr/sec` mem_alloc 
#   <bch:expr>                                                 <bch> <bch:>     <dbl> <bch:byt>   
# 1 apply(df, 1, function(x) identical(table(x[1:300]), table… 1.68s  1.68s     0.594     333MB   
# 2 apply(df, 1, function(x) all(sort(x[1:300]) == sort(x[301… 9.01s  9.01s     0.111     191MB

PS 2: Replacing table(x) with collapse::fcount(x, sort = TRUE) gives a major speedup.

tmfmnk On 08 January 2024 at 13:48

A dplyr and vecsets option could be:

df %>%
 rowwise() %>%
 mutate(cond = vsetequal(c_across(col1:col3), c_across(col4:col6), multiple = TRUE))

  col1  col2  col3  col4  col5  col6  cond 
  <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
1 a     a     a     a     a     b     FALSE
2 b     b     a     b     a     b     TRUE 
3 c     d     a     d     c     a     TRUE 
4 d     a     a     a     d     a     TRUE

The same idea with purrr::pmap():

df %>%
 mutate(cond = pmap_lgl(across(col1:col6), 
                        ~ vsetequal(c(...)[1:3], c(...)[4:6], multiple = TRUE)))

This is an unreasonably inefficient solution, but out of curiosity:

df %>%
 rowwise() %>%
 mutate(cond = toString(sort(c_across(col1:col3))) == toString(sort(c_across(col4:col6))))

The same idea with purrr:pmap():

df %>%
 mutate(cond = pmap_lgl(across(col1:col6), 
                        ~ toString(sort(c(...)[1:3])) == toString(sort(c(...)[4:6]))))

Using the transpose logic from @SamR with vecsets:

df %>%
 mutate(cond = map2_lgl(.x = across(col1:col3) %>% t() %>% data.frame(), 
                        .y = across(col4:col6) %>% t() %>% data.frame(), 
                        vsetequal))

The same approach using data.table::transpose():

df %>%
 mutate(cond = map2_lgl(.x = data.table::transpose(across(col1:col3)), 
                        .y = data.table::transpose(across(col4:col6)), 
                        vsetequal))

SamR On 08 January 2024 at 13:59

Avoid iterating over rows

Generally iterating over rows is slow, and I have found dplyr::rowwise() approaches become very slow with more than a few thousand rows. It tends to be faster to use purrr::pmap() and much faster to iterate over columns.

base R approach

You could take the transpose of the relevant columns and iterate over the columns of that.

cols <- paste0("col", 1:6)

df$assessment <- df[cols] |>
    t() |>
    data.frame() |>
    sapply(\(x) all(sort(x[1:3]) == sort(x[4:6])))

#   col1 col2 col3 col4 col5 col6 assessment
# 1    a    a    a    a    a    b      FALSE
# 2    b    b    a    b    a    b       TRUE
# 3    c    d    a    d    c    a       TRUE
# 4    d    a    a    a    d    a       TRUE

`tidyverse` approach: convert from wide to long

Alternatively if you want to remain in the tidyverse you could convert from wide to long:

df %>%
    mutate(
        assessment = . |>
            mutate(id = row_number()) |>
            tidyr::pivot_longer(
                -id,
                names_to = "col",
                names_transform = readr::parse_number
            ) |>
            group_by(id) |>
            summarise(
                assessment = all(
                    sort(
                        value[col %in% 1:3]
                    ) ==
                        sort(
                            value[col %in% 4:6]
                        )
                )
            ) |>
            pull(
                assessment
            )
    )

This is more verbose but I suspect it will be considerably faster with any reasonably-sized dataset.

lotus On 08 January 2024 at 14:25

As has been noted you should avoid rowwise operations. Here's an alternative that compares sets using a helper function that efficiently sorts by row so that comparisons are completely vectorized.

library(dplyr)

f <- function(set1, set2) {
  s1 <- as.matrix(pick({{set1}}))
  s2 <- as.matrix(pick({{set2}}))
  row_sort <- function(m) matrix(m[order(row(m), m)], ncol = ncol(m), byrow = TRUE)
  !rowSums(row_sort(s1) != row_sort(s2)) > 0
}

dat %>%
  mutate(assessment = f(col1:col3, col4:col6))

  col1 col2 col3 col4 col5 col6 assessment
1    a    a    a    a    a    b      FALSE
2    b    b    a    b    a    b       TRUE
3    c    d    a    d    c    a       TRUE
4    d    a    a    a    d    a       TRUE

ThomasIsCoding On 08 January 2024 at 14:33

Try the code below with split.default + colMeans

df$assessment <-
    colMeans(
        do.call(
            `==`,
            lapply(split.default(
                df,
                grepl("[1-3]$", names(df))
            ), \(d) apply(d, 1, sort))
        )
    ) == 1

which should give

> df
  col1 col2 col3 col4 col5 col6 assessment
1    a    a    a    a    a    b      FALSE
2    b    b    a    b    a    b       TRUE
3    c    d    a    d    c    a       TRUE
4    d    a    a    a    d    a       TRUE

TarJae On 08 January 2024 at 15:41

This one is quite verbose, but I can't resist. Here is one with pivoting:

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = starts_with("col"), names_to = "col_set") %>%
  group_by(group = (row_number() - 1) %/% ncol(df) + 1) %>% 
  mutate(x = lead(value, 3)) %>% 
  na.omit() %>% 
  mutate(across(c(value, x), ~sort(.))) %>% 
  summarize(check = all(value == x), .groups = "drop") %>% 
  bind_cols(df) %>% 
  select(-group)

# A tibble: 4 × 7
  check col1  col2  col3  col4  col5  col6 
  <lgl> <chr> <chr> <chr> <chr> <chr> <chr>
1 FALSE a     a     a     a     a     b    
2 TRUE  b     b     a     b     a     b    
3 TRUE  c     d     a     d     c     a    
4 TRUE  d     a     a     a     d     a

**Isaac** · Accepted Answer · 2024-01-08T13:42:55.207000

Using dplyr you can do the following:

df %>%
  rowwise() %>%
  mutate(result = all(sort(c_across(col1:col3)) == sort(c_across(col4:col6))))

# A tibble: 4 × 7
# Rowwise: 
  col1  col2  col3  col4  col5  col6  result
  <chr> <chr> <chr> <chr> <chr> <chr> <lgl> 
1 a     a     a     a     a     b     FALSE 
2 b     b     a     b     a     b     TRUE  
3 c     d     a     d     c     a     TRUE  
4 d     a     a     a     d     a     TRUE

Checking whether sets of columns are the same, row wise in R, in any order

There are 8 best solutions below

Avoid iterating over rows

base R approach

`tidyverse` approach: convert from wide to long

Related Questions in R

Related Questions in DPLYR

Related Questions in ROWWISE

Trending Questions

Popular # Hahtags

Popular Questions

Checking whether sets of columns are the same, row wise in R, in any order

There are 8 best solutions below

Avoid iterating over rows

base R approach

tidyverse approach: convert from wide to long

Related Questions in R

Related Questions in DPLYR

Related Questions in ROWWISE

Trending Questions

Popular # Hahtags

Popular Questions

`tidyverse` approach: convert from wide to long