Most Efficient Way to Combine n Number of Pyspark Dataframes

87 Views Asked by At

I have a particular function I'm needing to optimize with this basic structure:

list customer_dfs = []

for customer in customer_list
   df = // Pyspark Transformation Functions

   {10-15 lines of customer specific transformations/aggregations}

   customer_dfs.append(df)


combined_df = spark.createDataFrame([], customer_dfs[0].schema)

for df in customer_dfs:
   combined_df = combined_df.union(df)

return combined_df

Even though each of these dataframes are relatively small the performance of this iterative union clearly degrades with each iteration and quickly become untenable.

Is there a faster/more performant way to achieve the same result here? This is something we're looking to execute within the context of an AWS Glue 4.0 Job.

2

There are 2 best solutions below

2
Lingesh.K On

You can try using the reduce function from the functools module along with Dataframe.unionByName given that each dataframe is relatively small. I hope as well the column names are equivalent then:

from functools import reduce
from pyspark.sql import DataFrame

# Apply reduce using unionByName on the list of customer dataframes
combined_df = reduce(DataFrame.unionByName, customer_dfs)
0
NNM On

you should be able to use the function like below and keep doing unionbyName to any number of DFs

def _agg_dfs(*union_df):
    df = reduce(
        lambda df1, df2: df1.unionByName(df2, allowMissingColumns=True), union_df
    )

# use like below 
output_df = _agg_dfs(df1, df2, df3)