Most Efficient Way to Combine n Number of Pyspark Dataframes

87 Views Asked by daniel9x At 02 March 2024 at 21:47

I have a particular function I'm needing to optimize with this basic structure:

list customer_dfs = []

for customer in customer_list
   df = // Pyspark Transformation Functions

   {10-15 lines of customer specific transformations/aggregations}

   customer_dfs.append(df)


combined_df = spark.createDataFrame([], customer_dfs[0].schema)

for df in customer_dfs:
   combined_df = combined_df.union(df)

return combined_df

Even though each of these dataframes are relatively small the performance of this iterative union clearly degrades with each iteration and quickly become untenable.

Is there a faster/more performant way to achieve the same result here? This is something we're looking to execute within the context of an AWS Glue 4.0 Job.

Original Q&A

There are 2 best solutions below

Lingesh.K On 04 March 2024 at 07:46

You can try using the reduce function from the functools module along with Dataframe.unionByName given that each dataframe is relatively small. I hope as well the column names are equivalent then:

from functools import reduce
from pyspark.sql import DataFrame

# Apply reduce using unionByName on the list of customer dataframes
combined_df = reduce(DataFrame.unionByName, customer_dfs)

NNM On 04 March 2024 at 19:49

you should be able to use the function like below and keep doing unionbyName to any number of DFs

def _agg_dfs(*union_df):
    df = reduce(
        lambda df1, df2: df1.unionByName(df2, allowMissingColumns=True), union_df
    )

# use like below 
output_df = _agg_dfs(df1, df2, df3)

Most Efficient Way to Combine n Number of Pyspark Dataframes

There are 2 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in OPTIMIZATION

Related Questions in UNION

Related Questions in AWS-GLUE

Trending Questions

Popular # Hahtags

Popular Questions