Hello Stack Overflow community,
I'm facing a challenge while attempting to run model training for a Torch neural network on AWS EMR serverless, using Torch Distributor for distribution within Apache Spark.
For those unfamiliar with Torch Distributor in Spark, you can find more information in the official documentation.
The problem arises when I set local_mode=False during the initialization of the Torch Distributor, like this:
distributor = TorchDistributor(num_processes=num_processes,
local_mode=False,
use_gpu=False)
I have to set use_gpu=False because EMR serverless does not support GPU instances.
Upon defining it this way, my first issue was that BarrierExecution mode doesn't work with spark.dynamicAllocation.enabled=True
currently. Hence, I had to set it to false.
Now, the problem is that everything runs on the driver and there are no other executors, causing the training to take too long or fail when dealing with a large amount of data.
When I submit the job to EMR serverless, I specify the number of executors in my spark submit script, but EMR serverless overrides it and always shows that there is only driver as an executor. I am unsure why this happens. Here's the spark submit job configuration where I set the number of executors:
sparkSubmitParameters:
spark:
driver:
cores: "4"
memory: "6g"
executor:
cores: "1"
memory: "6g"
instances: "10"
Additionally, here's how I configured my Spark session:
spark = SparkSession.builder.appName('torch-model-training') \
.config("spark.driver.host", "0.0.0.0") \
.config("spark.executor.instances", "10")\
.config("spark.dynamicAllocation.enabled", "false") \
.getOrCreate()
I would greatly appreciate any insights or solutions regarding this issue. Thank you in advance for your help!