UDFs in Spark pipelines

833 Views Asked by Subramaniam Ramasubramanian At 12 February 2018 at 13:07

I create a UDF in python to compute an array of dates between two date columns in a table and register it to the spark session. I use this UDF in a pipeline to compute a new column.

Now when I save this pipeline to HDFS and expect it to be read back for execution in a different program(with a different spark session), the UDF is not available because its not globally registered anywhere. Since the process is generic and needs to run multiple pipelines, I dont want to add the UDF definition and register it to the spark session there.

Is there anyway for me to register UDFs globally across all spark session?

Can I add this in as a dependency somehow in a neat maintainable manner?

Original Q&A

There are 1 best solutions below

Edmond Jak On 01 June 2018 at 09:39

I have the same problem trying to save it from python and import it in scala.

I think I will just use SQL to do what I want to do.
I also saw I could use python .py file in Scala, but I do not find yet a way to use it in a UDF transformer.
If you want to use pyspark from a java pipeline think you can use the jar of the UDF using sql_context.udf.registerJavaFunction (or
sql_context.sql("CREATE function newF as 'f' USING JAR 'udf.jar'")), this seemed to work for me but I dont care since I need to do python => scala.

UDFs in Spark pipelines

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in USER-DEFINED-FUNCTIONS

Trending Questions

Popular # Hahtags

Popular Questions