Parallelize different scenarios for pandas UDF

72 Views Asked by At

I have created a pandas UDF (df->df) for scenario, which takes cares of parallel run for partition which can be found withing provided data - this is fine. However it was requested to have additional 20 scenarios for which a different sets of fixed parameters are used to run pUDF - I have adjusted pUDF to be flexible on these input parameters as having slightly different logic each, but I run them in series and would like to have these 20 scenarios run in parallel to shorten process time.

My question is how should I parallelize input parameters along with pUDF (aka nested parallelization)?

Example: Imagine I have 5 different scenarios for different ways of comparison (Mont to Previous Month, Month over last year Month, Quarter...etc.) as shown in below dictionary. I need to pass dic keys into pUDF so it can process accordingly - meaning each time actual vs previous periods contain diffent set of rows for aggregation which is inside pUDF.

How would I pass this 5 dict keys (params) into pUDF for one paralleled run instead of having a loop over them?

dic_param_compmethod = {
    "MtM": {'prv_end':-1   , 'mon_range' : None  },
    "MoM": {'prv_end':-12  , 'mon_range' : 0     },
    "QtQ": {'prv_end':-3   , 'mon_range' : -2    },
    "QoQ": {'prv_end':-12  , 'mon_range' : -2    },
    "YoY": {'prv_end':-12  , 'mon_range' : None  }
}

Based on what I found so far, I'm not sure what's the appropriate way of getting across this as I believe similar scenario must be a daily bread for many - if not this way, how this should be accomplished ? I can only think of running different jobs each with own cluster.

Many thanks for any guidance on this!

So far found: I thought SparkContext.parallelize() could help me but then also information that it is not able to handle nested parallelized function (Nesting parallelizations in Spark? What's the right approach? ).

0

There are 0 best solutions below