R tapply() does not work on data.frame due to improper length check

476 Views Asked by At

This is a bug report, not a question. The procedure to report bugs in R core appears complicated, and I don't want to be part of a mailing list. So I'm posting this here (as recommended by https://www.r-project.org/bugs.html.)

Here it is:

The tapply() help of R 4.0.3 says the following on argument X:

an R object for which a split method exists. Typically vector-like, allowing subsetting with [.

Issue: this R object cannot be a data.frame, although a data.frame can be split and subsetted.

To reproduce, run the following:

func <- function(dt) {
    sum(dt[,1] * dt[,2])
}

tab <- data.frame(x = sample(100), y = sample(100), z = sample(letters[1:10], 100, T))

tapply(tab[,1:2], INDEX = tab$z, FUN = func)

This results in

error in tapply(tab[, 1:2], INDEX = tab$z, FUN = func) : arguments must have same length

which, upon looking at the tapply()source code, appears to result from this check:

 if (!all(lengths(INDEX) == length(X))) 
        stop("arguments must have same length")

But length() is not the relevant function to call on a data.frame to determine if it has the right dimension for a split. nrow() should be used instead.

replacing the above code with

  if(is.data.frame(X)) {
     len <- nrow(X)
  } else {
        len <- length(X)
  }
  if (!all(lengths(INDEX) == len)) 
        stop("arguments must have same length")

solves the error.

This fix looks rather straightforward, and implementing it would increase the usefulness of tapply() by a lot (I know there are powerful alternatives to tapply()), so I wonder if the current limitation reflects a design choice.

1

There are 1 best solutions below

9
On

Based on the function, we could use

library(dplyr)
tab %>% 
     group_by(z) %>%
     summarise(new = func(cur_data()), .groups = 'drop')

-output

# A tibble: 10 x 2
#   z       new
#   <chr> <int>
# 1 a     26647
# 2 b     28010
# 3 c     31340
# 4 d     20780
# 5 e     33311
# 6 f     31880
# 7 g     37527
# 8 h      8752
# 9 i     15490

Or using by from base R

by(tab[, 1:2], tab$z, FUN = func)

According to ?tapply

X - an R object for which a split method exists. Typically vector-like, allowing subsetting with [.

Here, the tab[, 1:2] is a data.frame and not a vector. If it is a matrix, it would be a vector with dim attributes