Mahalanobis difference by group with dplyr

207 Views Asked by At

I want to get a Mahalanobis difference for each set of two scores, after being grouped by another variable. In this case, it would be a Mahalanobis difference for each Attribute (across each set of 2 scores). The output should be 3 Mahalanobis distances (one for A, B and C).

Currently I am working with (in my original dataframe, there are some NAs, hence I include one in the reprex):

library(tidyverse)
library(purrr)



df <- tibble(Attribute = unlist(map(LETTERS[1:3], rep, 5)),
             Score1 = c(runif(7), NA, runif(7)),
             Score2 = runif(15))

mah_db <- df %>% 
  dplyr::group_by(Attribute) %>% 
  dplyr::summarise(MAH = mahalanobis(Score1:Score2, 
                                     center = base::colMeans(Score1:Score2), 
                                     cov(Score1:Score2, use = "pairwise.complete.obs")))

This raises the error:

Caused by error in base::colMeans(): ! 'x' must be an array of at least two dimensions

But as far as I can tell, I am giving colMeans two columns.

So what's going wrong here? And I wonder if even fixing this gives a complete solution?

1

There are 1 best solutions below

0
On

It seems your question is more about the statistics than dplyr. So I just give a small example based on your data and an adapted example from ?mahalanobis. Perhaps also have a look here or here.

df <- subset(x = df0, Attribute == "A", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
plot(density(df$mahalanobis, bw = 0.3), ylim = c(0, 0.8),
     main="Squared Mahalanobis distances"); 
grid()
rug(df$mahalanobis)

df <- subset(x = df0, Attribute == "B", select = c("Score1", "Score2"))
df <- df[complete.cases(df), ]
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "red",
     main="Squared Mahalanobis distances"); 
rug(df$mahalanobis, col = "red")

df <- subset(x = df0, Attribute == "C", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "green",
     main="Squared Mahalanobis distances"); 
rug(df$mahalanobis, col = "green")

Hope, that helps (and too long for a comment).
(Of course you can make to code much shorter, but it shows in each step what happens.)