Spark UDF's contain the following functions: nullable, deterministic, dataType, etc. So according to this information, it would benefit from optimizations such as ConstantFolding. Which other optimizations does it benefit from and which optimizations can it not benefit from? I ask this because many presentations present UDFs as a black box which does not benefit from catalyst optimizations, but clearly, it benefits from ConstantFolding.
Which optimizations do UDFs not benefit from?
837 Views Asked by abden003 At
1
There are 1 best solutions below
Related Questions in APACHE-SPARK
- Spark .mapValues setup with multiple values
- Where do 'normal' println go in a scala jar, under Spark
- How to query JSON data according to JSON array's size with Spark SQL?
- How do I set the Hive user to something different than the Spark user from within a Spark program?
- How to add a new event to Apache Spark Event Log
- Spark streaming + kafka throughput
- dataframe or sqlctx (sqlcontext) generated "Trying to call a package" error
- Spark pairRDD not working
- How to know which worker a partition is executed at?
- Using HDFS with Apache Spark on Amazon EC2
- How to create a executable jar reading files from local file system
- How to keep a SQLContext instance alive in a spark streaming application's life cycle?
- Cassandra spark connector data loss
- Proper way to provide spark application a parameter/arg with spaces in spark-submit
- sorting RDD elements
Related Questions in APACHE-SPARK-SQL
- How to query JSON data according to JSON array's size with Spark SQL?
- dataframe or sqlctx (sqlcontext) generated "Trying to call a package" error
- How to keep a SQLContext instance alive in a spark streaming application's life cycle?
- How to setup cassandra and spark
- Where are the API docs for org.apache.spark.sql.cassandra for Spark 1.3.x?
- Spark Cassandra SQL can't perform DataFrame methods on query results
- SparkSQL - accesing nested structures Row( field1, field2=Row(..))
- Cassandra Bulk Load - NoHostAvailableException
- DSE Cassandra Spark Error
- How to add any new library like spark-csv in Apache Spark prebuilt version
- Scala extraction/pattern matching using a companion object
- Error importing types from Spark SQL
- Apache Spark, add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame
- one job takes extremely long on multiple left join in Spark-SQL (1.3.1)
- scala.MatchError: in Dataframes
Related Questions in CATALYST-OPTIMIZER
- How do you inspect candidate logical plans of cost-based SQL optimizer in spark (scala)?
- What is the role of Catalyst optimizer and Project Tungsten
- Does Spark SQL optimize lower() on both sides?
- How to structure large queries in spark
- Apache Spark What is the difference between requiredChildDistribution and outputPartitioning?
- Dataframe API vs Spark.sql
- How to create custom Spark-Native functions without forking/modifying Spark itself
- Spark optimize "DataFrame.explain" / Catalyst
- Query Cassandra from Spark using CassandraSQLContext
- is it possible to avoid second exchange when spark joins two datasets using joinWith?
- Parquet filter pushdown is not working with Spark Dataset API
- Export a spark logical/physical plan?
- Why would finding an aggregate of a partition column in Spark 3 take very long time?
- Which optimizations do UDFs not benefit from?
- Spark internals: benefits of Project
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Spark handles UDF's by wrapping them inside of a class. For example when you write the following:
What the
udffunction does is create aUserDefinedFunctionclass which in its apply function creates aScalaUDF.ScalaUDF extends Expression, and in its doCodeGen method it does the following:This function converts the
DataTypeof the column/expression to a Scala type (because your UDF operates on scala types), and then it calls your lambda. Thedeterministic,nullable,anddataTypesare functions of the wrapper of the user-defined function because it extends Expression, not your function. If you want to fully benefit from them, you would have to write a custom Expression which extendsExpressionor one of its sub-classes.Take the following as an example:
The optimized logical plan would look something like this:
As you can see it is doing the filter even though it is redundant and will always evaluate to true.
Whereas the following:
would give the following optimized logical plan:
It prunes out the filter using PruneFilter rule.
This doesn't mean all optimizations are excluded, there are optimizations which still work with UDFs such as
CombineFilterwhich combines the expression from two filters for example:This optimization works because it is only dependent on the
deterministicfield and UDFs are deterministic by default. So UDFs will benefit from simple optimizations that aren't dependent on the function it wraps. This is because it is in a format which catalyst doesn't understand, catalyst operates on Trees, and your closure is a Scala function. There are other places where UDFs lose out such as specifying the java code generated and spark type information.