resuming this question: Compute the pairwise distance in scipy with missing values
test case: I want to compute the pairwise distance of series with different length taht are grouped together and I have to do it in the most efficient possible way (using euclidean distance).
one way that makes it work could be this:
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist
a = pd.DataFrame(np.random.rand(10, 4), columns=['a','b','c','d'])
a.loc[0, 'a'] = np.nan
a.loc[1, 'a'] = np.nan
a.loc[0, 'c'] = np.nan
a.loc[1, 'c'] = np.nan
def dropna_on_the_fly(x, y):
return np.sqrt(np.nansum(((x-y)**2)))
pdist(starting_set, dropna_on_the_fly)
but I feel this could be very inefficient as built in methods of the pdist
function are internally optimized whereas the function is simply passed over.
I have a hunch that a vectorized solution in numpy
for which I broadcast
the subtraction and then I proceed with the np.nansum
for na
resistant sum but I am unsure on how to proceed.
Inspired by
this post
, there would be two solutions.Approach #1 : The vectorized solution would be -
Approach #2 : The memory-efficient and more performant one for large arrays would be -