Input dimensions for distance function for nearest neighbors

313 Views Asked by At

In the context of unsupervised nearest neighbors with scikit-learn, I have implemented my own distance function to deal with my uncertain points (i.e. a point is represented as a normal distribution):

def my_mahalanobis_distance(x, y):

'''
x: array of shape (4,) x[0]: mu_x_1, x[1]: mu_x_2, 
                        x[2]: cov_x_11, x[3]: cov_x_22
y: array of shape (4,) y[0]: mu_ y_1, y[1]: mu_y_2,
                        y[2]: cov_y_11, y[3]: cov_y_22 
'''     

    cov_inv = np.linalg.inv(np.diag(x[:2])+np.diag(y[:2]))
    return sp.spatial.distance.mahalanobis(x[:2], y[:2], cov_inv)

However, when I set my nearest neighbors:

nnbrs = NearestNeighbors(n_neighbors=1, metric='pyfunc', func=my_mahalanobis_distance)
nearest_neighbors = nnbrs.fit(X)

where X is a (N, 4) (n_samples, n_features) array, if I print x and y in my my_mahalanobis_distance, I get shapes of (10,) instead of (4,) as I would expect.

Example:

I add the following line to my_mahalanobis_distance:

print(x.shape)

Then in my main:

n_features = 4
n_samples = 10
# generate X array:
X = np.random.rand(n_samples, n_features)
nnbrs = NearestNeighbors(n_neighbors=1, metric='pyfunc', func=my_mahalanobis_distance)
nearest_neighbors = nnbrs.fit(X)

The result is:

(10,)
ValueError: shapes (2,) and (8,8) not aligned: 2 (dim 0) != 8 (dim 0)

I perfectly understand the error, but I do not understand why my x.shape is (10,) while my number of features is 4 in X.

I am using Python 2.7.10 and scikit-learn 0.16.1.

EDIT:

replacing return sp.spatial.distance.mahalanobis(x[:2], y[:2], cov_inv) by return 1 just for testing return:

(10,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)

So only the first call to my_mahalanobis_distance is wrong. Looking at the x and y values at this first iteration, my observations are:

  • x and y are identical

  • if I run my code multiple times, x and y are still identical but their values have change compared to the previous run.

  • these values seem coming from a numpy.random function.

I would conclude that such a first call is a debugging piece of code which has not been removed.

2

There are 2 best solutions below

6
On BEST ANSWER

I customed my my_mahalanobis_distance to handle this issue:

def my_mahalanobis_distance(x, y):
    '''
    x: array of shape (4,) x[0]: mu_x_1, x[1]: mu_x_2, 
                            x[2]: cov_x_11, x[3]: cov_x_22
    y: array of shape (4,) y[0]: mu_ y_1, y[1]: mu_y_2,
                            y[2]: cov_y_11, y[3]: cov_y_22 
    '''     

    if (x.size, y.size) == (4, 4):        

        return sp.spatial.distance.mahalanobis(x[:2], y[:2], 
                                           np.linalg.inv(np.diag(x[2:]) 
                                           + np.diag(y[2:])))

    # to handle the buggy first call when calling NearestNeighbors.fit()
    else:
        warnings.warn('x and y are respectively of size %i and %i' % (x.size, y.size))
        return sp.spatial.distance.euclidean(x, y)
0
On

This is not an answer, yet too long for a comment. I can not reproduce the error.

Using:

Python 3.5.2 and Sklearn 0.18.1

with the code:

from sklearn.neighbors import NearestNeighbors
import numpy as np
import scipy as sp
n_features = 4
n_samples = 10
# generate X array:
X = np.random.rand(n_samples, n_features)


def my_mahalanobis_distance(x, y):    
    cov_inv = np.linalg.inv(np.diag(x[:2])+np.diag(y[:2]))
    print(x.shape)
    return sp.spatial.distance.mahalanobis(x[:2], y[:2], cov_inv)

n_features = 4
n_samples = 10
# generate X array:
X = np.random.rand(n_samples, n_features)
nnbrs = NearestNeighbors(n_neighbors=1, metric=my_mahalanobis_distance)
nearest_neighbors = nnbrs.fit(X)

The output is

(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)