PySpark pairwise distance between row

532 Views Asked by madfrostx At 01 July 2025 at 20:31

Now I am working with PySpark, and wondering is there a way to do pairwise distance between row. For instance, there is a dataset like this.

+--------------------+------------+--------+-------+-------+
|             product| Mitsubishi | Toyota | Tesla | Honda |
+--------------------+------------+--------+-------+-------+
|Mitsubishi          |           0|     0.8|    0.2|      0|
|Toyota              |           0|       0|      0|      0|  
|Tesla               |         0.1|     0.4|      0|    0.3|
|Honda               |           0|     0.5|    0.1|      0|
+--------------------+------------+--------+-------+-------+

I'm curious, because in pandas I used this line of code using sklearn:

from sklearn.metrics import pairwise_distances
array = df1_corr.drop(columns=['new_product_1']).values
correlation = pairwise_distances(array, array, metric = 'correlation')

How about PySpark, is there any built in pairwise_distance on it? or in sparkml?

Original Q&A

There are 1 best solutions below

glory9211 On 12 October 2021 at 08:59

The way to go with your problems is pandas_udf. Here is a good read and examples similar to your scenario.

https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

https://towardsdatascience.com/scalable-python-code-with-pandas-udfs-a-data-science-application-dd515a628896

PySpark pairwise distance between row

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in CORRELATION

Related Questions in PAIRWISE-DISTANCE

Trending Questions

Popular # Hahtags

Popular Questions