I have a PySpark job defined in a main() function.
It reads in a DataFrame from HDFS, and for every row, it calls the OpenAI API, feeding row['text'] as input and saving the response into a new column 'response'.
df = spark.read.format('avro').load(data_path)
for row in df.toLocalIterator():
result = chain.run(row['text'])
Now I essentially need 10 clients to call the OpenAI API. With that being said, I need the main() function to be executed in 10 workers/executors/processes/threads/etc. simultaneously.
(The PySpark job is hosted by Azkaban, i.e., it's executed inside an Azkaban flow.)
What is the best solution of achieving this? Appreciate any help :)
df.toLocalIteratorreturns the rows to the Driver, so even if you have N executors/slots rows are not processed by them. My suggestion is to wrap your call to OpenAI in a UDF. Then make sure your input DataFramedfhas at least 10 partitions so that each slot can process a partition.