Parallelization of ml_linear_regression in sparklyr

37 Views Asked by At

I am working with a dataframe that has the following structure, using sparklyr and dplyr:

lab   | x | y
------|---|---
a     | 1 | 1
a     | 2 | 2 
a     | 3 | 3
b     | 1 | 2
b     | 2 | 4 
b     | 3 | 6

What i need is a linear regression model trained for every partition of lab and because the original dataframe is quite big i cannot afford to use a cycle that iterates over the partitions and serially calculates all the regressions. Therefore i tried using the spark_apply function, as follows:

df = data.frame("x"=c(1, 2, 3, 1, 2, 3),
                "y"=c(1, 2, 3, 2, 4, 6), 
                "lab" = c("a", "a", "a", "b", "b", "b"))

sdf_df = sdf_copy_to(sc, df, overwrite = TRUE)

fit_part = function(df){
  model = ml_linear_regression(df, y ~ x)
  result = ml_predict(model, select(df, "x"))
  return(result)
}

spark_apply(sdf_df, fit_part, group_by = "lab")

This produces an obscure error starting by:

Error: org.apache.spark.SparkException: Job aborted due to stage failure

Is it even possible to use spark_apply this way? And if not, how would you go about solving this problem? thank you!

0

There are 0 best solutions below