I want to update a row value based on condition checking and comparing values in already existing three records and if it matches then update the new row record with corresponding data from the resulting matched records.
I need to compare three columns (LastName, Birthdate and NationalID) in the dataframe, if these three row value matches twice with different EmployerCode, then the pyspark code should lookup EmployeeUniqueID Column and use the matching EmployeeUniqueID value corresponding to the matched records.
If the 3 columns matches twice with different EmployerCode, it means that there are 2 different Employer Code for the same Lastname + first name + national ID?
Assuming your reference dataframe is like this:
I believe one way to approach this is to separate the reference dataframe into 2 parts
The code below splits it into the 2 parts mentioned
Now can join it accordingly, inner join your dataframe with part_one and get the
EmployeeUniqueIDsince records which matches this join has duplicated EmployeeCode.The join with part_two will give you the corresponding
EmployeeCode, any records that joins with this table will have a distinct employee Code