I would like to convert a smaller dataframe to become a broadcast lookup table to be used inside the UDF of another larger dataframe. This smaller dataframe (myLookupDf) may look like something below:
+---+---+---+---+
| x | 90|100|101|
+---+---+---+---+
| 90| 1| 0| 0|
|100| 0| 1| 1|
|101| 0| 1| 1|
+---+---+---+---+
I want to use the first column as the first key, say x1, and the first row as the second key. x1 and x2 have the same elements. Ideally, the lookup table (myLookupMap) will be a Scala Map (or similar) and work like:
myLookupMap(90)(90) returns 1
myLookupMap(90)(101) returns 0
myLookupMap(100)(90) returns 0
myLookupMap(101)(100) return 1
etc.
So far, I manage to have:
val myLookupMap = myLookupDf.collect().map(r => Map(myLookupDf.columns.zip(r.toSeq):_*))
myLookupMap: Array[scala.collection.Map[String,Any]] = Array(Map(x -> 90, 90 -> 1, 100 -> 0, 101 -> 0), Map(x -> 100, 90 -> 0, 100 -> 1, 101 -> 1), Map(x -> 101, 90 -> 0, 100 -> 1, 101 -> 1))
which is an Array of Map and not exactly what is required. Any suggestions are much appreciated.
collect()
always createrdd
which is equivalent toArray
. You have to find ways to collect thearrays
asmaps
.Given the
dataframe
asAll you need are the header names other than
x
so you can do something like belowI am just modifying your
map
functions to getMap
as the resultYou should see that you get the desired results.
Now you can pass the
myLookupMap
to yourudf
function