expected data format for package rstatix in R?

55 Views Asked by At

I'm struggling and VERY frustrated with this package. I don't see any explanation or description for what the expected input data structures are, and every time I use any rstatix functions, they break because my data isn't in the "right" format. The examples given in the package are not helping, because when I imitate them, the functions still break. I'm very close to just dumping the package because it is so non-user friendly.

For example, I can create an ultra simple pre- and post- columns of data that I want to use a paired t-test function on.

df <- data.frame( cbind( as.numeric(rep(1:5)), as.numeric(rep(2:6)) )
colnames( df ) <- c( "pre", "post" )
df
  pre post
1   1    2
2   2    3
3   3    4
4   4    5
5   5    6

Now if I try the t-test:

> t_test( df, formula = pre~post, paired=T )
Error in t.test.default(x = 1L, y = 2L, paired = TRUE, var.equal = FALSE,  : 
  not enough 'x' observations

> t_test( df, formula = post~pre, paired=T )
Error in t.test.default(x = 2, y = 3, paired = TRUE, var.equal = FALSE,  : 
  not enough 'x' observations

I don't understand the difference between this data frame and the ToothGrowth example they give in the rstatix package that works perfectly:

> head(ToothGrowth)
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5


> t_test( ToothGrowth, formula = len~dose, paired=T )
# A tibble: 3 x 10
  .y.   group1 group2    n1    n2 statistic    df        p         p.adj p.adj.signif
* <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl>         <dbl> <chr>       
1 len   0.5    1         20    20     -6.97    19 1.23e- 6 0.00000246    ****        
2 len   0.5    2         20    20    -11.3     19 7.19e-10 0.00000000216 ****        
3 len   1      2         20    20     -4.60    19 1.93e- 4 0.000193      ***         

I've also tried formats of this type, which doesn't work either:

df1 <- cbind( rep("pre",times=5), as.numeric(seq(1,5,by=1)))
df2 <- cbind( rep("post",times=5), as.numeric(seq(2,6,by=1)))
df <- data.frame( df1, df2 )
df
> df
     X1 X2
1   pre  1
2   pre  2
3   pre  3
4   pre  4
5   pre  5
6  post  2
7  post  3
8  post  4
9  post  5
10 post  6

Can somebody please point me to a simple, straightforward description of what exactly rstatix expects as input to its functions?

1

There are 1 best solutions below

0
Allan Cameron On

According to the help file obtained by typing ?rstatix::t_test into the console, the first two arguments to the function are:

data
a data.frame containing the variables in the formula.

formula
a formula of the form x ~ group where x is a numeric variable giving the data values and group is a factor with one or multiple levels giving the corresponding groups. For example, formula = TP53 ~ cancer_group.

From reading this, it seems clear that the numerical values you want to compare should all be in a single column, and the grouping variable should be in a different column. This is the structure that the data frame you included at the bottom of your question has, and in data science circles this would be described as a "long-format" data frame.

When specifying the formula, we need to make sure that the numeric column is on the left hand side of the formula, and the grouping variable is on the right hand side, as described in the docs above.

If we take your own example:

df <- data.frame(X1 = rep(c('pre', 'post'), each = 5), X2 = c(1:5, 2:6))

df
#>      X1 X2
#> 1   pre  1
#> 2   pre  2
#> 3   pre  3
#> 4   pre  4
#> 5   pre  5
#> 6  post  2
#> 7  post  3
#> 8  post  4
#> 9  post  5
#> 10 post  6

Then we can call t_test with X2 on the left hand side of the formula (since it contains the numeric values we wish to compare), and X1 on the right (since it contains the grouping variable pre/post):

rstatix::t_test(X2 ~ X1, data = df)
#> # A tibble: 1 x 8
#>   .y.   group1 group2    n1    n2 statistic    df     p
#> * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl>
#> 1 X2    post   pre        5     5         1     8 0.347

This seems like a fairly standard way to structure data and call a formula-based statistical test in R. It matches how we might use lm, aov, glm, t.test, model.frame, not to mention functions in several important extension packages.

However, we often start with our numeric variables in different columns (as in the data frame at the top of your question, which we would describe as "wide format"). If so, we would need to reshape our data to get such functions to work as expected.

One option would be to pivot into long format. For example, if we had:

df <- data.frame(A = 1:5, B = 2:6, C = 3:7)

df
#>   A B C
#> 1 1 2 3
#> 2 2 3 4
#> 3 3 4 5
#> 4 4 5 6
#> 5 5 6 7

Then we could pivot into long format like this:

df_long <- tidyr::pivot_longer(df, tidyr::everything(), names_to = 'treatment')

df_long
#> # A tibble: 15 x 2
#>    treatment value
#>    <chr>     <int>
#>  1 A             1
#>  2 B             2
#>  3 C             3
#>  4 A             2
#>  5 B             3
#>  6 C             4
#>  7 A             3
#>  8 B             4
#>  9 C             5
#> 10 A             4
#> 11 B             5
#> 12 C             6
#> 13 A             5
#> 14 B             6
#> 15 C             7

And we would get the expected result by putting the value column on the left of the formula, and the treatment column on the right.

rstatix::t_test(value ~ treatment, data = df_long)
#> # A tibble: 3 x 10
#>   .y.   group1 group2    n1    n2 statistic    df     p p.adj p.adj.signif
#> * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <dbl> <chr>       
#> 1 value A      B          5     5        -1     8 0.347 0.694 ns          
#> 2 value A      C          5     5        -2     8 0.08  0.242 ns          
#> 3 value B      C          5     5        -1     8 0.347 0.694 ns

Important!

Note that we can't use a paired t-test in any of the examples you have given, because a paired t-test just subtracts one column from the other and tests whether the mean of the resulting vector is statistically different from 0. Since there is a fixed difference of 1 between the two variables, this results in t.test (which rstatix::t_test uses under the hood) calculating the standard error of the vector c(1, 1, 1, 1, 1) which is 0. It can't use this to do any significance testing and throws an error. This should not be a problem with real data.