My SPARK project (written in Java) requires to access (SELECT query results) different tables across executors.
One solution to this problem is :
- I create a tempView
- select required columns
- using forEach convert
DataFrame
toMap
. - pass that map as a broadcast variable across executors.
However, I have found that
- there many complex queries whose result cant be stored directly in
Map
- Tables are very large and hence creating
Map
of large size and passing it to executors as a broadcast variable doesn't sound efficient.
Instead can we load tables in-memory using load
which can be shared across executors?
Is void org.apache.spark.sql.Dataset.createOrReplaceTempView(String viewName)
or void org.apache.spark.sql.Dataset.createGlobalTempView(String viewName) throws AnalysisException
Method useful for this purpose?
SPARK VERSION : 2.3.0
You can broadcast a DataFrame. See documentation