I have created a pandas UDF (df->df) for scenario, which takes cares of parallel run for partition which can be found withing provided data - this is fine. However it was requested to have additional 20 scenarios for which a different sets of fixed parameters are used to run pUDF - I have adjusted pUDF to be flexible on these input parameters as having slightly different logic each, but I run them in series and would like to have these 20 scenarios run in parallel to shorten process time.
My question is how should I parallelize input parameters along with pUDF (aka nested parallelization)?
Example: Imagine I have 5 different scenarios for different ways of comparison (Mont to Previous Month, Month over last year Month, Quarter...etc.) as shown in below dictionary. I need to pass dic keys into pUDF so it can process accordingly - meaning each time actual vs previous periods contain diffent set of rows for aggregation which is inside pUDF.
How would I pass this 5 dict keys (params) into pUDF for one paralleled run instead of having a loop over them?
dic_param_compmethod = {
"MtM": {'prv_end':-1 , 'mon_range' : None },
"MoM": {'prv_end':-12 , 'mon_range' : 0 },
"QtQ": {'prv_end':-3 , 'mon_range' : -2 },
"QoQ": {'prv_end':-12 , 'mon_range' : -2 },
"YoY": {'prv_end':-12 , 'mon_range' : None }
}
Based on what I found so far, I'm not sure what's the appropriate way of getting across this as I believe similar scenario must be a daily bread for many - if not this way, how this should be accomplished ? I can only think of running different jobs each with own cluster.
Many thanks for any guidance on this!
So far found: I thought SparkContext.parallelize() could help me but then also information that it is not able to handle nested parallelized function (Nesting parallelizations in Spark? What's the right approach? ).