I would like to convert a smaller dataframe to become a broadcast lookup table to be used inside the UDF of another larger dataframe. This smaller dataframe (myLookupDf) may look like something below:
+---+---+---+---+
| x | 90|100|101|
+---+---+---+---+
| 90| 1| 0| 0|
|100| 0| 1| 1|
|101| 0| 1| 1|
+---+---+---+---+
I want to use the first column as the first key, say x1, and the first row as the second key. x1 and x2 have the same elements. Ideally, the lookup table (myLookupMap) will be a Scala Map (or similar) and work like:
myLookupMap(90)(90) returns 1
myLookupMap(90)(101) returns 0
myLookupMap(100)(90) returns 0
myLookupMap(101)(100) return 1
etc.
So far, I manage to have:
val myLookupMap = myLookupDf.collect().map(r => Map(myLookupDf.columns.zip(r.toSeq):_*))
myLookupMap: Array[scala.collection.Map[String,Any]] = Array(Map(x -> 90, 90 -> 1, 100 -> 0, 101 -> 0), Map(x -> 100, 90 -> 0, 100 -> 1, 101 -> 1), Map(x -> 101, 90 -> 0, 100 -> 1, 101 -> 1))
which is an Array of Map and not exactly what is required. Any suggestions are much appreciated.
collect()always createrddwhich is equivalent toArray. You have to find ways to collect thearraysasmaps.Given the
dataframeasAll you need are the header names other than
xso you can do something like belowI am just modifying your
mapfunctions to getMapas the resultYou should see that you get the desired results.
Now you can pass the
myLookupMapto yourudffunction