Issue with EMR and Torch Distributor Setup for Training Neural Network

100 Views Asked by Aleksandar Milutinovic At 09 August 2025 at 01:44

Hello Stack Overflow community,

I'm facing a challenge while attempting to run model training for a Torch neural network on AWS EMR serverless, using Torch Distributor for distribution within Apache Spark.
For those unfamiliar with Torch Distributor in Spark, you can find more information in the official documentation.

The problem arises when I set local_mode=False during the initialization of the Torch Distributor, like this:

distributor = TorchDistributor(num_processes=num_processes,
                               local_mode=False,
                               use_gpu=False)

I have to set use_gpu=False because EMR serverless does not support GPU instances.

Upon defining it this way, my first issue was that BarrierExecution mode doesn't work with spark.dynamicAllocation.enabled=True currently. Hence, I had to set it to false. Now, the problem is that everything runs on the driver and there are no other executors, causing the training to take too long or fail when dealing with a large amount of data.

When I submit the job to EMR serverless, I specify the number of executors in my spark submit script, but EMR serverless overrides it and always shows that there is only driver as an executor. I am unsure why this happens. Here's the spark submit job configuration where I set the number of executors:

sparkSubmitParameters:
  spark:
    driver:
      cores: "4"
      memory: "6g"
    executor:
      cores: "1"
      memory: "6g"
      instances: "10"

Additionally, here's how I configured my Spark session:

spark = SparkSession.builder.appName('torch-model-training') \
        .config("spark.driver.host", "0.0.0.0") \
        .config("spark.executor.instances", "10")\
        .config("spark.dynamicAllocation.enabled", "false") \
        .getOrCreate()

I would greatly appreciate any insights or solutions regarding this issue. Thank you in advance for your help!

Original Q&A

Issue with EMR and Torch Distributor Setup for Training Neural Network

There are 0 best solutions below

Related Questions in PYSPARK

Related Questions in PYTORCH

Related Questions in EMR-SERVERLESS

Related Questions in PYTORCH-DISTRIBUTIONS

Trending Questions

Popular # Hahtags

Popular Questions