Alternate approach for pdist() from scipy in Julia?

296 Views Asked by At

My objective is to replicate the functionality of pdist() from SciPy in Julia. I tried using Distances.jl package to perform pairwise computation of distance between observations. However, the results are not same as seen in the below mentioned example.

Python Example:

from scipy.spatial.distance import pdist
a = [[1,2], [3,4], [5,6], [7,8]]
b = pdist(a)
print(b)

output --> array([2.82842712, 5.65685425, 8.48528137, 2.82842712, 5.65685425, 2.82842712])

Julia Example:

using Distances
a = [1 2; 3 4; 5 6; 7 8]
dist_function(x)  = pairwise(Euclidean(), x, dims = 1)
dist_function(a)

output --> 
4×4 Array{Float64,2}:
 0.0      2.82843  5.65685  8.48528
 2.82843  0.0      2.82843  5.65685
 5.65685  2.82843  0.0      2.82843
 8.48528  5.65685  2.82843  0.0

With reference to above examples:

  1. Is pdist() from SciPy in python has metric value set to Euclidean() by default?
  2. How may I approach this problem, to replicate the results in Julia?

Please suggest a solution to resolve this problem.

Documentation reference for pdist() :--> https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

Thanks in advance!!

2

There are 2 best solutions below

1
On

According to the documentation page you linked, to get the same form as Julia from python (yes, I know, this is the reverse of your question), you can pass it to squareform. I.e. in your example, add

from scipy.spatial.distance import squareform
squareform(b)

Also, yes, from the same documentation page, you can see that the 'metric' parameter defaults to 'euclidean' if not explictly defined.

For the reverse situation, simply note that the python vector is simply all the elements in the off-diagonal (since for a 'proper' distance metric, the resulting distance matrix is symmetric).

So you can simply collect all the elements from the off-diagonal into a vector.

1
On

For (1), the answer is yes as per the documentation you linked, which says at the top

scipy.spatial.distance.pdist(X, metric='euclidean', *args, **kwargs)

indicating that the metric arg is indeed set to 'euclidean' by default.

I'm not sure I understand your second question - the results are the same? The only difference to me seems to be that scipy returns the upper triangular as a vector, so if it's just about doing this have a look at: https://discourse.julialang.org/t/vector-of-upper-triangle/7764