update value in specific row by checking condition for another column values in pyspark

35 Views Asked by At

I want to update a row value based on condition checking and comparing values in already existing three records and if it matches then update the new row record with corresponding data from the resulting matched records.

I need to compare three columns (LastName, Birthdate and NationalID) in the dataframe, if these three row value matches twice with different EmployerCode, then the pyspark code should lookup EmployeeUniqueID Column and use the matching EmployeeUniqueID value corresponding to the matched records.

1

There are 1 best solutions below

0
Lau On

If the 3 columns matches twice with different EmployerCode, it means that there are 2 different Employer Code for the same Lastname + first name + national ID?

Assuming your reference dataframe is like this:

lastName|birthDate|nationalId|EmployeeCode
AAA|BBB|123|Emp1
AAA|BBB|123|Emp2
DDD|EEE|789|Emp6
XXX|YYY|456|Emp5

I believe one way to approach this is to separate the reference dataframe into 2 parts

  1. Those with more than 1 employee code for the same lastname, birthday, nationalId
  2. Those with only 1 employee code for the same 3 columns

The code below splits it into the 2 parts mentioned

import pyspark.sql.functions as F

df = spark.createDataFrame([
    ('AAA', 'BBB', '123', 'Emp1'),
    ('AAA', 'BBB', '123', 'Emp2'),
    ('XXX', 'YYY', '456', 'Emp5'),
    ('DDD', 'EEE', '789', 'Emp6')
], ['lastName', 'birthDate', 'nationalId', 'EmployeeCode'])


aggregated_df = df.groupBy("lastName", "birthDate", "nationalId").agg(F.countDistinct("EmployeeCode").alias("distinctEmployeeCode"))
part_one = aggregated_df.filter(F.col("distinctEmployeeCode") > 1)
part_two = aggregated_df.filter(F.col("distinctEmployeeCode") <= 1)

Now can join it accordingly, inner join your dataframe with part_one and get the EmployeeUniqueID since records which matches this join has duplicated EmployeeCode.

The join with part_two will give you the corresponding EmployeeCode, any records that joins with this table will have a distinct employee Code