I have Connection reset Error whilst running PySpark with 150 million rows of data

56 Views Asked by At

The code below gives me the error. I have searched for solutions but to no avail.

volume_filter_df = result_df.filter(result_df.BILLINGERRORCODE == '0')

# Filter out rows where 'CATEGORY' is not null
filtered_df = volume_filter_df.filter(volume_filter_df['CATEGORY'].isNotNull())

# Perform the aggregation
volume_analysis = filtered_df.groupby('CATEGORY').agg(func.count('TRANSACTIONTYPE').alias('Count'))
volume_analysis.show()

This is the error I get.

Py4JJavaError: An error occurred while calling o189.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 25.0 failed 1 times, most recent failure: Lost task 5.0 in stage 25.0 (TID 46) (LT-IT-263.C.COM executor driver): java.net.SocketException: Connection reset

I tried the below solutions but it didn't work.

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
0

There are 0 best solutions below