I have a particular function I'm needing to optimize with this basic structure:
list customer_dfs = []
for customer in customer_list
df = // Pyspark Transformation Functions
{10-15 lines of customer specific transformations/aggregations}
customer_dfs.append(df)
combined_df = spark.createDataFrame([], customer_dfs[0].schema)
for df in customer_dfs:
combined_df = combined_df.union(df)
return combined_df
Even though each of these dataframes are relatively small the performance of this iterative union clearly degrades with each iteration and quickly become untenable.
Is there a faster/more performant way to achieve the same result here? This is something we're looking to execute within the context of an AWS Glue 4.0 Job.
You can try using the
reducefunction from thefunctoolsmodule along withDataframe.unionByNamegiven that each dataframe is relatively small. I hope as well the column names are equivalent then: