Now I am working with PySpark, and wondering is there a way to do pairwise distance between row. For instance, there is a dataset like this.
+--------------------+------------+--------+-------+-------+
| product| Mitsubishi | Toyota | Tesla | Honda |
+--------------------+------------+--------+-------+-------+
|Mitsubishi | 0| 0.8| 0.2| 0|
|Toyota | 0| 0| 0| 0|
|Tesla | 0.1| 0.4| 0| 0.3|
|Honda | 0| 0.5| 0.1| 0|
+--------------------+------------+--------+-------+-------+
I'm curious, because in pandas I used this line of code using sklearn
:
from sklearn.metrics import pairwise_distances
array = df1_corr.drop(columns=['new_product_1']).values
correlation = pairwise_distances(array, array, metric = 'correlation')
How about PySpark, is there any built in pairwise_distance
on it? or in sparkml
?
The way to go with your problems is pandas_udf. Here is a good read and examples similar to your scenario.
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
https://towardsdatascience.com/scalable-python-code-with-pandas-udfs-a-data-science-application-dd515a628896