All values are strings:
| first_name_l | first_name_r | last_name_l | last_name_r | dob_l | dob_r | city_l | city_r | average_score | matched_columns | lookup_l_list | lookup_r_list |
|---|---|---|---|---|---|---|---|---|---|---|---|
| robert | robert | null | allen | 1971-06-24 | 1971-05-24 | null | null | 49.57 | [first_name_score, dob_score] | [first_name_l, dob_l] | [first_name_r, dob_r] |
| null | robert | allen | alen | 1971-06-24 | 1971-06-24 | london | lonon | 69.95 | [dob_score, city_score, last_name_score] | [dob_l, city_l, last_name_l] | [dob_r, city_r, last_name_r] |
How to get another column named 'lookup_l' for first row that has value =[robert,1971-06-24] (values are derived from the df but column names are checked from column lookup_l_list of first row and then values are also written in same sequence as column names in lookup_l_list ) and 'lookup_r' must have value = [robert,1971-05-24] (values are derived from the df but column names are checked from column lookup_r_list of first row). Same should be done with other rows also. How to code it using pyspark. I have tried different approaches but getting error every time.
Expecting a df with all previous column and two new columns named 'lookup_l' and 'lookup_r' with expected values (list of values from other columns).
This should help you out.