All values are strings:

first_name_l first_name_r last_name_l last_name_r dob_l dob_r city_l city_r average_score matched_columns lookup_l_list lookup_r_list
robert robert null allen 1971-06-24 1971-05-24 null null 49.57 [first_name_score, dob_score] [first_name_l, dob_l] [first_name_r, dob_r]
null robert allen alen 1971-06-24 1971-06-24 london lonon 69.95 [dob_score, city_score, last_name_score] [dob_l, city_l, last_name_l] [dob_r, city_r, last_name_r]

How to get another column named 'lookup_l' for first row that has value =[robert,1971-06-24] (values are derived from the df but column names are checked from column lookup_l_list of first row and then values are also written in same sequence as column names in lookup_l_list ) and 'lookup_r' must have value = [robert,1971-05-24] (values are derived from the df but column names are checked from column lookup_r_list of first row). Same should be done with other rows also. How to code it using pyspark. I have tried different approaches but getting error every time.

Expecting a df with all previous column and two new columns named 'lookup_l' and 'lookup_r' with expected values (list of values from other columns).

1

There are 1 best solutions below

0
user238607 On

This should help you out.

from pyspark.sql import functions as F
columns_to_concat = ["col1", "col2", "col3"]
new_df = df.withColumn("new_array_column", F.array(columns_to_concat))