So I have been trying to explode AWS Glue jobs parallelization features. This is the approach that I have been following:
- Reading data from the Glue DataCatalog with
glueContext.create_dynamic_frame.from_cataloginto DynamicFrames - Creating transformations function to process these DynamicFrames.
- Using ThreadPoolExecutor to run the transformations in parallel (code example below)
with ThreadPoolExecutor(max_workers=5) as executor:
output_dfs = executor.map(transform_func, dynamic_df)
The problem is that I get the following error:
TypeError: 'DynamicFrame' object is not iterable
I tried also converting the DynamicFrame to a PySpark dataframe using the .toDF() method (see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-toDF) before passing it to executor.map, but toDF() must be part of the transformation I want to parallelize (I had found that running this method is way to slow outside, so I want to include it inside the body of transform_func)
My question is: How can we parallelize a DynamicFrame i.e. pass it as executor.map() arguments?