Udf vs pandas_udf on an extremely large datset

58 Views Asked by At

I am trying to understand the difference in performance of pandas_udf vs. udf in case of a very large dataset. As per the documentation and release videos from Databricks, it seems pandas_udf (which are vectorized) are better performing than udf (which are one row at a time). However, I am seeing the reverse:

Say a dataframe df has 4 columns x1,x2,y1 and y2 and I tested below two functions to create the distance column :

import pandas as pd
from pyspark.sql.functions import col, pandas_udf, udf
from pyspark.sql.types import DoubleType
from datetime import datetime

NNN = 10**7

df = spark.createDataFrame(
  [(r.random() * 10.0**6, r.random() * 10.0**6, r.random() * 10.0**6, r.random() * 10.0**6)
  for i in range(1,NNN)],
  schema='x1: double, x2: double, y1: double, y2: double'
)

@udf(DoubleType())
def euclidean_distance_udf(x1, y1, x2, y2):
  return ((x2-x1)**2 + (y2-y1)**2)**0.5


t = datetime.now()
# Apply the PySpark UDF to create a new column 'distance'
result_df = df.withColumn("distance", euclidean_distance_udf(col("x1"),col("y1"), col("x2"), col("y2")))
_ = result_df.collect()
result_df.show()
print('udf', datetime.now() - t)

Pandas version:

# -------------------------------------------------------------------------------------------
# Pandas UDF takes double the time of pyspark

df2 = spark.createDataFrame(
  [(r.random() * 10.0**6, r.random() * 10.0**6, r.random() * 10.0**6, r.random() * 10.0**6)
  for i in range(1,NNN)],
  schema='x1: double, x2: double, y1: double, y2: double'
)

@pandas_udf(DoubleType())
def euclidean_distance_pandas_udf(x1:pd.Series, y1:pd.Series,x2:pd.Series,y2:pd.Series)-> pd.Series:
        return ((x2-x1)**2 + (y2-y1)**2)**0.5
    
    
t = datetime.now()
result_df_pandas = df.withColumn("distance", euclidean_distance_pandas_udf(col("x1"), col("y1"), col("x2"), col("y2")))
_ = result_df_pandas.collect()
result_df_pandas.show()
print('pandas_udf', datetime.now() - t)

Any help is appreciated.

0

There are 0 best solutions below