Mahalanobis distance with 0 variance

58 Views Asked by At

I'm developing a clustering algorithm, I'm using the Mahalanobis distance to decide if a point must be included in a cluster or not. A cluster is represented by the statistics of the points that are included in it, the statistics are:

  • the sum of the components of all the points in each dimension named SUMi;
  • the sum of the squares of the components of all the points in each dimension, named SUMQi.

Choosing these statistics, the centroid’s coordinate in the ith dimension is the SUMi/N, where N is the number of points in the cluster. Talking about the variance, I should compute the variance in the ith dimension as SUMSQi/N − (SUMi/N)^2. To initialize clusters I choose k random point in the dataset, where k is a priori known number of clusters, in this way I have k clusters, each represented by the statistics of 1 point, this means that SUMSQi/N = (SUMi/N)^2 and thus variance is 0, so I can't compute Mahalanobis distance because it need to divide by the variance of the cluster and in this case, it is 0. Which is the correct approach to this problem? And since I'm working with d-dimensional points how should I choose the threshold to discriminate if a point should be added to a cluster? Is using the module or the mean of the element of the variance vector a good idea?

The clusters are represtend by (2*d) + 1 elements: the first is the number of points in the cluster, from the second to the d-th element there are the SUMi, from the d-th to the (2*d)-th there are the SUMQi. This is the function I implemented in C to compute variance of a cluster and Mahalanobis distance of a point from the cluster:

double mahalanobis(double *cluster, double *point, int dimension) {
   
    double* std_dev = malloc(dimension * sizeof(double ));

    /**
     * variance of the cluster
     */


    for (int i = 0; i < dimension; ++i) {
        std_dev[i] = cluster[i + dimension + 1]/cluster[0] - pow(cluster[i + 1]/cluster[0], 2);
    }

    
    /**
     * mahalanobis distance
     */

    double sum = 0;
    for (int i = 0; i < dimension; ++i) {
        sum += pow(point[i] - ((cluster[i+1] + point[i])/(cluster[0] + 1)), 2)/std_dev[i];
    }

    free(std_dev);

    return sqrt(sum);
}
0

There are 0 best solutions below