Sample t-test inside a loop

43 Views Asked by At

I'm trying to mimic an experiment within a dataframe, running numerous simulations with t-tests to extract significance values.

I created the "treatment" variable within my dataset to randomly sort observations into the treatment group (1) or control group (0). I then ran a t-test testing the treatment variable against "ideology," a continuous variable within my dataset, and tried to extract the difference between the groups and the p-values.

set.seed(17800)

differences <- c()
p_values <- c()
 
for (i in 1:5000) {
  test <- experiment_pool %>%
   mutate(treatment = sample(rep(c(0,1), length = n())))
  t_test_result <- t.test(x = test$Ideology, y = test$treatment)
  differences[i] <- diff(t_test_result$estimate)
  p_values[i] <- t_test_result$p.value
}


results <- data.frame(Difference = differences, PValue = p_values)%>%
  mutate(sig_diff = if_else(p_value < 0.5, 1, 0))

My resulting code has constant values for the "Difference" and "PValue" variables. I checked that the "treatment" sampling has good distribution, so I think the problem is in the t-test step.

1

There are 1 best solutions below

0
On

Since you didn't upload code about what your variable experiment_pool looks like, I rewrote the code so that it would achieve what I think you wanted to achieve (please correct me if this was not what you were looking for).

Since a t-test is based on the general linear model where we tend to assume that our predictor variables (here the treatment) are fixed variables and only the y-variables (here ideology) are random variables, the only thing that really needs to be varied per replication are the values of the y-variable. In my code, I generated a treatment variable where the groups are equal, but you could of course test scenarios with different group sizes as well.

One then needs to pick a population value of how strongly the treatment variable is related to Ideology, i.e. how strong the population difference in ideology between treatment and control group is. I used the package faux because one can create bivariately associated variables via a correlation coefficient, which I find rather handy.

library(faux)
library(tidyverse)

set.seed(17800)

differences <- c()
p_values <- c()

# Size of samples that will be drawn from population
n <- 50

# Number of replications
R <- 5000

# Predictor variable: treatment vs control
treatment <- rep(c(1, 2), times = n/2)

for (i in 1:R) {
  Ideology <- rnorm_pre(treatment, mu = 0, sd = 1, r = 0.3)
  t_test_result <- t.test(Ideology ~ treatment, paired = FALSE, 
                          var.equal = TRUE)
  # Changing var.equal to FALSE returns a Welch t-test
  
  differences[i] <- diff(t_test_result$estimate)
  p_values[i] <- t_test_result$p.value
}

results <- data.frame(Difference = differences, PValue = p_values) %>%
  mutate(sig_diff = if_else(p_values < 0.05, 1, 0)) 
# I assume you wanted to set your type I error to 0.05 not 0.50

percentage_sign <- sum(results$sig_diff)/R
# The percentage of significant results over all of your replications gives 
# you the power your t-test would have in this specific data situation (i.e.
# a sample size of 50 and a population value of rho = .30 would have a power of 0.57)

I think the main problem in your code was the way you defined the t-test, since when writing it as t.test(x = x_variable, y = y_variable) R thinks you have two continuous variables and tries to compare the mean of x_variable to the mean of y_variable, which leads to wrong results, when in reality the means you want to compare are the mean of group 1 of the x_variable (e.g mean Ideology of treatment group) to the mean of group 2 of the x_variable (e.g. mean Ideology of control group). Thus you have to write t.test(y_variable ~ x_variable).

Lastly, if you really wanted "Ideology" to be your x and "treatment" (dichotomous) to be your y, a t-test is not the right model and you should instead choose something like a logistic regression.