UDFs in Spark pipelines

826 Views Asked by At

I create a UDF in python to compute an array of dates between two date columns in a table and register it to the spark session. I use this UDF in a pipeline to compute a new column.

Now when I save this pipeline to HDFS and expect it to be read back for execution in a different program(with a different spark session), the UDF is not available because its not globally registered anywhere. Since the process is generic and needs to run multiple pipelines, I dont want to add the UDF definition and register it to the spark session there.

Is there anyway for me to register UDFs globally across all spark session?

Can I add this in as a dependency somehow in a neat maintainable manner?

1

There are 1 best solutions below

1
On

I have the same problem trying to save it from python and import it in scala.

  • I think I will just use SQL to do what I want to do.

  • I also saw I could use python .py file in Scala, but I do not find yet a way to use it in a UDF transformer.

  • If you want to use pyspark from a java pipeline think you can use the jar of the UDF using sql_context.udf.registerJavaFunction (or
    sql_context.sql("CREATE function newF as 'f' USING JAR 'udf.jar'")), this seemed to work for me but I dont care since I need to do python => scala.