I am receiving the error while running my emr-serverless pyspark sql code:
ERROR:root:An error occurred while calling o221.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 53.0 failed 1 times, most recent failure: Lost task 0.0 in stage 53.0 (TID 176) (ip-10-1-20-165.ec2.internal executor driver): TaskResultLost (result lost from block manager)
I don't see this issue when run outside a vpc, but I do when I run in a vpc. When I run in a vpc with a small amount of rows (< 10k) I dont receive the error either.
I use sqlspark for some operations as well as dataframe functions. I parition the datadfp = df.repartition(200, "vehicle_id")
.
I start with the following configuration although emr-serverless should scale:
InitialCapacity:
- Key: DRIVER
Value:
WorkerCount: 2
WorkerConfiguration:
Cpu: 16vCPU
Memory: "64GB"
Disk: "200GB"
- Key: EXECUTOR
Value:
WorkerCount: 5
WorkerConfiguration:
Cpu: 16vCPU
Memory: "64GB"
Disk: "200GB"
Im expecting this code to work, I've run the same code previously in a provisioned emr container using the same vpc without issues.