how to apply dictionary key to value project to a column in dataset in spark?

1.4k Views Asked by At

Newbie here on spark... how can I use a column in spark dataset ask key to get some values and add the values as new column to the dataset?

In python, we have something like:

 df.loc[:,'values'] = df.loc[:,'key'].apply(lambda x: D.get(x))

where D is a function in python defined earlier.

how can I do this in spark using Java? thank you.

Edit: for example: I have a following dataset df:

A
1
3
6
0
8

I want to create a weekday column based on the following dictionary:

D[1] = "Monday"
D[2] = "Tuesday"
D[3] = "Wednesday"
D[4] = "Thursday"
D[5] = "Friday"
D[6] = "Saturday"
D[7] = "Sunday"

and add the column back to my dataset df:

A    days
1    Monday
3    Wednesday
6    Saturday
0    Sunday
8    NULL

This is just an example, column A could be anything other than integers of course.

1

There are 1 best solutions below

1
On
  1. You can use df.withColumn to return a new df with the new column values and the previous values of df.
  2. create a udf function (user defined functions) to apply the dictionary mapping.

here's a reproducible example:

>>> from pyspark.sql.types import StringType 
>>> from pyspark.sql.functions import udf 
>>> df = spark.createDataFrame([{'A':1,'B':5},{'A':5,'B':2},{'A':1,'B':3},{'A':5,'B':4}], ['A','B'])
>>> df.show() 
+---+---+
|  A|  B|
+---+---+
|  1|  5|
|  5|  2|
|  1|  3|
|  5|  4|
+---+---+

>>> d = {1:'x', 2:'y', 3:'w', 4:'t', 5:'z'}
>>> mapping_func = lambda x: d.get(x) 
>>> df = df.withColumn('values',udf(mapping_func, StringType())("A"))
>>> df.show() 
+---+---+------+
|  A|  B|values|
+---+---+------+
|  1|  5|     x|
|  5|  2|     z|
|  1|  3|     x|
|  5|  4|     z|
+---+---+------+