Pyspark Combine Columns in different rows to a single row order by another Column

38 Views Asked by At

enter image description here

I have a dataframe which have 2 Columns CLMN_SEQ_NUM and CLMN_NM. I am tryng to combine the Columns CLMN_NM to a single row comma separated.

Desired o/p PR_NAME,PR_ID,PR_ZIP,PR_ADDRESS,PR_COUNTRY

cols_comb = df.agg(F.concat_ws(",",F.collect_list(F.col("CLMN_NM")))).first()[0]

But the order is coming differently PR_ZIP,PR_NAME,PR_COUNTRY,PR_ID,PR_ADDRESS as to not how its in the dataframe.

How can I combine the columns to be ordered by CLMN_SEQ_NUM

1

There are 1 best solutions below

0
Shubham Sharma On

Pack CLMN_NM and CLMN_SEQ_NUM into struct then aggregate the dataframe to collect all the structs and sort

L = df.agg(F.collect_list(F.struct('CLMN_SEQ_NUM', 'CLMN_NM'))).first()[0]
cols = [r.CLMN_NM for r in sorted(L)]

# cols
['PR_NAME', 'PR_ZIP', 'PR_ADDRESS', 'PR_COUNTRY', 'PR_ID']