Pyspark - TypeError: unhashable type: 'Column'

394 Views Asked by At

I have a dataframe which contains a 3 byte country code and I need to pass that to a dictionary to return the full country name.

    data = [[1,'USA'],[2,'CAN']]
    cols = ['s.no','country']
    df = spark.createDataFrame(data,cols)
    df.show()

+----+-------+
|s.no|country|
+----+-------+
|   1|    USA|
|   2|    CAN|
+----+-------+

dict_country = {'USA':'United States', 'CAN':'Canada'}

Expected output:

+----+-------+-------------+
|s.no|country| country_name|
+----+-------+-------------+
|   1|    USA|United States|
|   2|    CAN|       Canada|
+----+-------+-------------+

I can accomplish this by

df.withColumn('country_name',F.when(F.col('country')=='USA',F.lit(dict_country['USA'])).when(F.col('country')=='CAN',F.lit(dict_country['CAN']))).show()

but I dont want to write the code for each country, instead I tried

df.withColumn('country_name', F.lit(dict_country[F.col('country')])).show()

and it gives TypeError: unhashable type: 'Column'. Is there a better way to accomplish this?

1

There are 1 best solutions below

0
On

Approach 1

Create a map from the dictionary then use it substitute value. This method might be more efficient when dealing with larger dataframes also this is conceptually similar to what you are trying to do

m = F.create_map(*[x for k, v in dict_country.items() for x in (F.lit(k), F.lit(v))])
result = df.withColumn('country', m[F.col('country')])

Approach 2

Create a dataframe from dictionary then broadcast join with original dataframe

c = spark.createDataFrame(dict_country.items(), ['country', 'name'])
result = df.join(c.hint('broadcast'), on='country', how='left').selectExpr('`s.no`', 'name as country')

+----+-------------+
|s.no|      country|
+----+-------------+
|   1|United States|
|   2|       Canada|
+----+-------------+