Hanging Task in Databricks

171 Views Asked by At

I am applying a pandas UDF to a grouped dataframe in databricks. When I do this, a couple tasks hang forever, while the rest complete quickly.

I start by repartitioning my dataset so that each group is in one partition:

group_factors = ['a','b','c'] #masked for anonymity

model_df = (
    df
    .repartition(
        num_cores, #partition into max number of cores on this compute
        group_factors #partition by group so a group is always in same partition
        )
    )

I then group my dataset and apply the udf:

results = (
model_df #use repartitioned data
    .groupBy(group_factors) #build groups
    .applyInPandas(udf_tune, schema=result_schema) #apply in parallel
    )

#write results table to store parameters
results.write.mode('overwrite').saveAsTable(table_name)

Spark then splits this into tasks equal to the number of partitions. It runs successfully for all but two tasks. Those two tasks do not throw errors, but instead hang until the timeout threshold on the job.

Spark UI for compute cluster

What is strange is that these groups/tasks do not appear to have any irregularities. The record size is similar to the other 58 completed tasks. The code does not throw any errors, so we don't have incorrectly typed or formatted data. Further, this command actually completes successfully about 20% of the time. But most days, we get caught on one or two hanging tasks that cause the job to fail.

The stderr simply notes that the task is hanging:

stderr for hanging task

The stdout notes an allocation error (although all completed tasks contain the same allocation failure in their stdout files):

stdout for hanging task

Any suggestions for how to avoid the hanging task issue?

When I reduce my data size (for example, splitting model_df into 4 smaller subsets, grouping and applying on each subset, and appending results) I do not run into this issue.

0

There are 0 best solutions below