In R, I am trying to calculate Mahalanobis distances to check if there are outliers in my data set, to test one of the assumptions for a MANOVA. I have missing values in my data set. I originally had tried the mahalanabois function, but that didn't seem to work with missing values, so I tried the MDmiss function in the modi package. This worked for the cases where I had missing values in two of my variables both (DO, and chla). However, if I was only missing data in chla or DO, the distances were not calculated. Neither the MDmiss nor the mahalanobis function returned distances when I lacked missing values.
I had also tried using the is.na and na.omit arguments in the original Mahalanobis distances function, but that didn't work either. I have included a sample data set. Appreciate the help. Thanks.
envdata <- data.frame(WaterTemp = c(56.7, 56.4, 60.8,60.6, 59.3, 57.5, 57.9, 65.8,59.2, 59), SPC = c(46600, 47520, 47821, 47801, 47999, 47418, 47646, 49156, 46350, 46260), Salinity = c(30.28, 30.92, 31.54, 31.34, 31.24, 30.87, 31.03, 32.17, 30.12, 30.05), DO = c(NA, NA, 96, NA, NA, NA, NA, 101, 99, 103), Chla = c(7.045, NA, 8.358, NA, NA, NA, 6.306, 26.84, NA, NA))
#Check for outliers using the Mahalanobis distance
#https://www.statology.org/mahalanobis-distance-r/
#Mahalanobis only works on numeric data. Make new data frame with only numeric variables
#Convert integers to numeric
envdata <- envdata %>% mutate(SPC = as.numeric(envdata$SPC), DO = as.numeric(envdata$DO))
envdata_numeric <- envdata %>% dplyr::select(WaterTemp, SPC, Salinity, DO, Chla)
#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- mahalanobis(envdata_numeric, colMeans(envdata_numeric, na.rm = TRUE), cov(envdata_numeric))
#create new column in data frame to hold p-value for each Mahalanobis distance
envdata_numeric$p <- pchisq(envdata_numeric$mahal, df = 4, lower.tail = FALSE)
#Df = (c-1)
#DF = 5-1
envdata_numeric
#***#error with calculating distances. Possibly because of NA values. Try this other package. https://search.r-project.org/CRAN/refmans/modi/html/MDmiss.html
devtools::install_github("martinSter/modi")
library(modi)
#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- MDmiss(envdata_numeric, colMeans(envdata_numeric), cov(envdata_numeric))
There is a problem with the data you shown, columns
DOandChalare collinear. Namely you have only two complete observation (see Row 3 and 8 ofenvdata_numericbelow):Roughly speaking you are trying to find outliers or calculate distances however you do not have enough information to "draw the elipsoid" around the cloud of your points. This is what geometrically
mahalanobisis doing. I sketched the situation below: by white circles are columns withoutNA, big red are indicate two points which are located in higher dimensions (Row 3 and 8). There are infinitely many elipsoids that can be drawn through 2 points and the center (I drew 2).Anyway if I add some data point into
DOcolumn e.g. to Row 1100then proceed with imputation (I usedmicepackage) I can formally calculate distances. As you will see p-values will be > 0.1. The meaning that however the algorithm works, it is not enough to judge about outliers even on 3 observations. Too muchNAs.Output: