I create a UDF in python to compute an array of dates between two date columns in a table and register it to the spark session. I use this UDF in a pipeline to compute a new column.
Now when I save this pipeline to HDFS and expect it to be read back for execution in a different program(with a different spark session), the UDF is not available because its not globally registered anywhere. Since the process is generic and needs to run multiple pipelines, I dont want to add the UDF definition and register it to the spark session there.
Is there anyway for me to register UDFs globally across all spark session?
Can I add this in as a dependency somehow in a neat maintainable manner?
I have the same problem trying to save it from python and import it in scala.
I think I will just use SQL to do what I want to do.
I also saw I could use python .py file in Scala, but I do not find yet a way to use it in a UDF transformer.
If you want to use
pysparkfrom a java pipeline think you can use the jar of the UDF usingsql_context.udf.registerJavaFunction(orsql_context.sql("CREATE function newF as 'f' USING JAR 'udf.jar'")), this seemed to work for me but I dont care since I need to do python => scala.