Sparse implementations of distance computations in python / scikit-learn

3.1k Views Asked by At

I have a large (100K by 30K) and (very) sparse dataset in svmlight format which I load as follows:

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_svmlight_file

X,Y = load_svmlight_file("somefile_svm.txt")

which returns a sparse scipy array X

I simply need to compute the pairwise distances of all training points as

D = pdist(X)

Unfortunately, distance computation implementations in scipy.spatial.distance work only for dense matrices. Due to the size of the dataset it is infeasible to, say, use pdist as

D = pdist(X.todense())

Any pointers to sparse matrix distance computation implementations or workarounds with regards to this problem will be greatly appreciated.

Many thanks

1

There are 1 best solutions below

5
On

In scikit-learn there is a sklearn.metrics.euclidean_distances function that works both for sparse matrices and dense numpy arrays. See the reference documentation.

However non-euclidean distances are not yet implemented for sparse matrices.