Checking whether sets of columns are the same, row wise in R, in any order

1k Views Asked by At

I am working in R, and would prefer a dplyr solution if possible.

sample data:

data.frame(
  col1 = c("a", "b", "c", "d"),
  col2 = c("a", "b", "d", "a"),
  col3 = rep("a", 4L),
  col4 = c("a", "b", "d", "a"),
  col5 = c("a", "a", "c", "d"),
  col6 = rep(c("b", "a"), each = 2L)
)
col1 col2 col3 col4 col5 col6
a a a a a b
b b a b a b
c d a d c a
d a a a d a

Question

I would like to know for each row, whether col1, col2 and col3 are the same as col4, col5 and col6, but the order of col1 - col3 and col4 - col6 should be ignored.

So for row 1, if col1 - col3 contained a,a,b respectively, and col4 - col6 contained b,a,a respectively, then that would be considered a match.

Desired result

Have put a note on "assessment" column to aid understanding

col1 col2 col3 col4 col5 col6 assessment
a a a a a b FALSE (because 1-3 are not same as 4-6)
b b a b a b TRUE (because 1-3 are the same as 4-6, if ignore order)
c d a d c a TRUE (because 1-3 are the same as 4-6, if ignore order)
d a a a d a TRUE (because 1-3 are the same as 4-6, if ignore order)
8

There are 8 best solutions below

1
Isaac On BEST ANSWER

Using dplyr you can do the following:

df %>%
  rowwise() %>%
  mutate(result = all(sort(c_across(col1:col3)) == sort(c_across(col4:col6))))

# A tibble: 4 × 7
# Rowwise: 
  col1  col2  col3  col4  col5  col6  result
  <chr> <chr> <chr> <chr> <chr> <chr> <lgl> 
1 a     a     a     a     a     b     FALSE 
2 b     b     a     b     a     b     TRUE  
3 c     d     a     d     c     a     TRUE  
4 d     a     a     a     d     a     TRUE  
0
jpsmith On

In base R you could use vapply and sort:

df$assessment <- vapply(seq_len(nrow(df)), \(x) 
                        all(sort(unlist(df[x,1:3])) == sort(unlist(df[x,4:6]))), logical(1))

Output:

#   col1 col2 col3 col4 col5 col6 assessment
# 1    a    a    a    a    a    b      FALSE
# 2    b    b    a    b    a    b       TRUE
# 3    c    d    a    d    c    a       TRUE
# 4    d    a    a    a    d    a       TRUE
1
s_baldur On

Base R:

df$assessment <- apply(df, 1, \(x) identical(table(x[1:3]), table(x[4:6])))

#   col1 col2 col3 col4 col5 col6 assessment
# 1    a    a    a    a    a    b      FALSE
# 2    b    b    a    b    a    b       TRUE
# 3    c    d    a    d    c    a       TRUE
# 4    d    a    a    a    d    a       TRUE

Reproducible data:

df <- data.frame(
  col1 = c("a", "b", "c", "d"), col2 = c("a", "b", "d", "a"),
  col3 = c("a", "a", "a", "a"), col4 = c("a", "b", "d", "a"),
  col5 = c("a", "a", "c", "d"), col6 = c("b", "b", "a", "a")
)

PS: Why table() and indentical instead of sort(), ==, all()? I would expect it to scale better with the number of columns (given low number of unique values). Example:

df <- as.data.frame(lapply(1:600, \(x) sample(letters, size = 4000, replace = TRUE)))
bench::mark(
  apply(df, 1, \(x) identical(table(x[1:300]), table(x[301:600]))),
  apply(df, 1, \(x) all(sort(x[1:300]) == sort(x[301:600])))
)
#   expression                                                   min median `itr/sec` mem_alloc 
#   <bch:expr>                                                 <bch> <bch:>     <dbl> <bch:byt>   
# 1 apply(df, 1, function(x) identical(table(x[1:300]), table… 1.68s  1.68s     0.594     333MB   
# 2 apply(df, 1, function(x) all(sort(x[1:300]) == sort(x[301… 9.01s  9.01s     0.111     191MB 

PS 2: Replacing table(x) with collapse::fcount(x, sort = TRUE) gives a major speedup.

1
tmfmnk On

A dplyr and vecsets option could be:

df %>%
 rowwise() %>%
 mutate(cond = vsetequal(c_across(col1:col3), c_across(col4:col6), multiple = TRUE))

  col1  col2  col3  col4  col5  col6  cond 
  <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
1 a     a     a     a     a     b     FALSE
2 b     b     a     b     a     b     TRUE 
3 c     d     a     d     c     a     TRUE 
4 d     a     a     a     d     a     TRUE

The same idea with purrr::pmap():

df %>%
 mutate(cond = pmap_lgl(across(col1:col6), 
                        ~ vsetequal(c(...)[1:3], c(...)[4:6], multiple = TRUE)))

This is an unreasonably inefficient solution, but out of curiosity:

df %>%
 rowwise() %>%
 mutate(cond = toString(sort(c_across(col1:col3))) == toString(sort(c_across(col4:col6))))

The same idea with purrr:pmap():

df %>%
 mutate(cond = pmap_lgl(across(col1:col6), 
                        ~ toString(sort(c(...)[1:3])) == toString(sort(c(...)[4:6]))))

Using the transpose logic from @SamR with vecsets:

df %>%
 mutate(cond = map2_lgl(.x = across(col1:col3) %>% t() %>% data.frame(), 
                        .y = across(col4:col6) %>% t() %>% data.frame(), 
                        vsetequal))

The same approach using data.table::transpose():

df %>%
 mutate(cond = map2_lgl(.x = data.table::transpose(across(col1:col3)), 
                        .y = data.table::transpose(across(col4:col6)), 
                        vsetequal))
0
SamR On

Avoid iterating over rows

Generally iterating over rows is slow, and I have found dplyr::rowwise() approaches become very slow with more than a few thousand rows. It tends to be faster to use purrr::pmap() and much faster to iterate over columns.

base R approach

You could take the transpose of the relevant columns and iterate over the columns of that.

cols <- paste0("col", 1:6)

df$assessment <- df[cols] |>
    t() |>
    data.frame() |>
    sapply(\(x) all(sort(x[1:3]) == sort(x[4:6])))

#   col1 col2 col3 col4 col5 col6 assessment
# 1    a    a    a    a    a    b      FALSE
# 2    b    b    a    b    a    b       TRUE
# 3    c    d    a    d    c    a       TRUE
# 4    d    a    a    a    d    a       TRUE

tidyverse approach: convert from wide to long

Alternatively if you want to remain in the tidyverse you could convert from wide to long:

df %>%
    mutate(
        assessment = . |>
            mutate(id = row_number()) |>
            tidyr::pivot_longer(
                -id,
                names_to = "col",
                names_transform = readr::parse_number
            ) |>
            group_by(id) |>
            summarise(
                assessment = all(
                    sort(
                        value[col %in% 1:3]
                    ) ==
                        sort(
                            value[col %in% 4:6]
                        )
                )
            ) |>
            pull(
                assessment
            )
    )

This is more verbose but I suspect it will be considerably faster with any reasonably-sized dataset.

2
lotus On

As has been noted you should avoid rowwise operations. Here's an alternative that compares sets using a helper function that efficiently sorts by row so that comparisons are completely vectorized.

library(dplyr)

f <- function(set1, set2) {
  s1 <- as.matrix(pick({{set1}}))
  s2 <- as.matrix(pick({{set2}}))
  row_sort <- function(m) matrix(m[order(row(m), m)], ncol = ncol(m), byrow = TRUE)
  !rowSums(row_sort(s1) != row_sort(s2)) > 0
}

dat %>%
  mutate(assessment = f(col1:col3, col4:col6))

  col1 col2 col3 col4 col5 col6 assessment
1    a    a    a    a    a    b      FALSE
2    b    b    a    b    a    b       TRUE
3    c    d    a    d    c    a       TRUE
4    d    a    a    a    d    a       TRUE
0
ThomasIsCoding On

Try the code below with split.default + colMeans

df$assessment <-
    colMeans(
        do.call(
            `==`,
            lapply(split.default(
                df,
                grepl("[1-3]$", names(df))
            ), \(d) apply(d, 1, sort))
        )
    ) == 1

which should give

> df
  col1 col2 col3 col4 col5 col6 assessment
1    a    a    a    a    a    b      FALSE
2    b    b    a    b    a    b       TRUE
3    c    d    a    d    c    a       TRUE
4    d    a    a    a    d    a       TRUE
0
TarJae On

This one is quite verbose, but I can't resist. Here is one with pivoting:

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = starts_with("col"), names_to = "col_set") %>%
  group_by(group = (row_number() - 1) %/% ncol(df) + 1) %>% 
  mutate(x = lead(value, 3)) %>% 
  na.omit() %>% 
  mutate(across(c(value, x), ~sort(.))) %>% 
  summarize(check = all(value == x), .groups = "drop") %>% 
  bind_cols(df) %>% 
  select(-group)

# A tibble: 4 × 7
  check col1  col2  col3  col4  col5  col6 
  <lgl> <chr> <chr> <chr> <chr> <chr> <chr>
1 FALSE a     a     a     a     a     b    
2 TRUE  b     b     a     b     a     b    
3 TRUE  c     d     a     d     c     a    
4 TRUE  d     a     a     a     d     a