I'm struggling and VERY frustrated with this package. I don't see any explanation or description for what the expected input data structures are, and every time I use any rstatix functions, they break because my data isn't in the "right" format. The examples given in the package are not helping, because when I imitate them, the functions still break. I'm very close to just dumping the package because it is so non-user friendly.
For example, I can create an ultra simple pre- and post- columns of data that I want to use a paired t-test function on.
df <- data.frame( cbind( as.numeric(rep(1:5)), as.numeric(rep(2:6)) )
colnames( df ) <- c( "pre", "post" )
df
pre post
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
Now if I try the t-test:
> t_test( df, formula = pre~post, paired=T )
Error in t.test.default(x = 1L, y = 2L, paired = TRUE, var.equal = FALSE, :
not enough 'x' observations
> t_test( df, formula = post~pre, paired=T )
Error in t.test.default(x = 2, y = 3, paired = TRUE, var.equal = FALSE, :
not enough 'x' observations
I don't understand the difference between this data frame and the ToothGrowth example they give in the rstatix package that works perfectly:
> head(ToothGrowth)
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
> t_test( ToothGrowth, formula = len~dose, paired=T )
# A tibble: 3 x 10
.y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 len 0.5 1 20 20 -6.97 19 1.23e- 6 0.00000246 ****
2 len 0.5 2 20 20 -11.3 19 7.19e-10 0.00000000216 ****
3 len 1 2 20 20 -4.60 19 1.93e- 4 0.000193 ***
I've also tried formats of this type, which doesn't work either:
df1 <- cbind( rep("pre",times=5), as.numeric(seq(1,5,by=1)))
df2 <- cbind( rep("post",times=5), as.numeric(seq(2,6,by=1)))
df <- data.frame( df1, df2 )
df
> df
X1 X2
1 pre 1
2 pre 2
3 pre 3
4 pre 4
5 pre 5
6 post 2
7 post 3
8 post 4
9 post 5
10 post 6
Can somebody please point me to a simple, straightforward description of what exactly rstatix expects as input to its functions?
According to the help file obtained by typing
?rstatix::t_testinto the console, the first two arguments to the function are:From reading this, it seems clear that the numerical values you want to compare should all be in a single column, and the grouping variable should be in a different column. This is the structure that the data frame you included at the bottom of your question has, and in data science circles this would be described as a "long-format" data frame.
When specifying the formula, we need to make sure that the numeric column is on the left hand side of the formula, and the grouping variable is on the right hand side, as described in the docs above.
If we take your own example:
Then we can call
t_testwithX2on the left hand side of the formula (since it contains the numeric values we wish to compare), andX1on the right (since it contains the grouping variablepre/post):This seems like a fairly standard way to structure data and call a formula-based statistical test in R. It matches how we might use
lm,aov,glm,t.test,model.frame, not to mention functions in several important extension packages.However, we often start with our numeric variables in different columns (as in the data frame at the top of your question, which we would describe as "wide format"). If so, we would need to reshape our data to get such functions to work as expected.
One option would be to pivot into long format. For example, if we had:
Then we could pivot into long format like this:
And we would get the expected result by putting the
valuecolumn on the left of the formula, and thetreatmentcolumn on the right.Important!
Note that we can't use a paired t-test in any of the examples you have given, because a paired t-test just subtracts one column from the other and tests whether the mean of the resulting vector is statistically different from 0. Since there is a fixed difference of 1 between the two variables, this results in
t.test(whichrstatix::t_testuses under the hood) calculating the standard error of the vectorc(1, 1, 1, 1, 1)which is 0. It can't use this to do any significance testing and throws an error. This should not be a problem with real data.