I am trying to generate z scores (then pvalues) by group, by only changing one of the groups each time, ie comparing each group to another 'reference' group, with the idea that I can do hypothesis testing to see if they are distinct distributions.
In the below example, I would like to perform z-tests on a, b and c, all against d:
z-test comparing a vs d,
z-test comparing b vs d
z-test comparing c vs d
> df
group measurement
a 1
a 2
b 6
b 7
b 9
c 4
c 5
c 4
d 8
d 8
so that my end df looks something like this:
> group_df
group pvalue
a 0.005
b 0.3
c 0.001
d 1.000
So far I have something like this:
# d group stats
d_only <- df %>% filter(grepl("d", group)) %>% select("measurement")
d_mean <- mean(admeasurement)
d_n <- nrow(d_only)
# generate values needed to calculate zscore
group_df <- df %>% group_by(group) %>% summarise_each(funs(mean, sd, n()))
group_df$sqrt_n <- (group_df$n + d_n) %>% sqrt()
group_df$pop_mean <- (group_df$mean + d_mean) / 2
# calculate zscore
group_df $zscore <-
(group_df$mean - group_df$pop_mean) / (group_df$sd / group_df$sqrt_n)
group_df$pvalue <- pnorm(-abs(zscore))
But I am getting some p values that seem wrong, and it fees like there should be a more elegant way of doing this.