I have a dataframe. A column "names" contains columns headers, which values should be concated. I want to do it with pyspark concat_ws() but nothing works. I must solve it with concat_ws() function, no pandas etc.
The best what I got it was concated headers, but not values in this columns. I couldn't return list from function to unpack it in concat_ws()
map_cols = {'a':'newA', 'b':'newB', 'c':'newC', 'd':'newD'}
@udf
def get_result(names_col):
headers = []
for i in names_col:
headers.append(map_cols[i])
return headers
df = df.withColumn('names_arr', split('names', '_')).withColumn('result', concat_ws(';', get_result(col('names_arr'))))
Input dataframe: names | newA|newB|newC|newD --------------------------- a_b |1 | 2 | 7 |8 --------------------------- a_b_c |2 | 3 | 4 |4 --------------------------- a_b_c_d |3 | 2 |4 |4 --------------------------- c_d | 89 | 5 |3 |5 --------------------------- b_c_d | 7 |5 |6 | 5 Expected output dataframe names | newA|newB|newC|newD|result -------------------------------------- a_b |1 | 2 | 7 | 8 |1;2 -------------------------------------- a_b_c |2 | 3 | 4 |4 |2;3;4 -------------------------------------- a_b_c_d |3 | 2 |4 |4 |2;3;4;4 -------------------------------------- c_d |89 | 5 |3 |5 |3;5 -------------------------------------- b_c_d |7 |5 |6 | 5 |5;6;5
I am assuming that in your expected output
colA
is a typo for the last two rows (89 and 7)You can iterate the
dataframe.columns
and performconcat_ws
Filetring column name if its
names
and the concat by;
separator