update value in specific row by checking condition for another column values in pyspark

35 Views Asked by Jasvant Nishad At 19 March 2024 at 17:13

I want to update a row value based on condition checking and comparing values in already existing three records and if it matches then update the new row record with corresponding data from the resulting matched records.

I need to compare three columns (LastName, Birthdate and NationalID) in the dataframe, if these three row value matches twice with different EmployerCode, then the pyspark code should lookup EmployeeUniqueID Column and use the matching EmployeeUniqueID value corresponding to the matched records.

Original Q&A

There are 1 best solutions below

Lau On 20 March 2024 at 07:08

If the 3 columns matches twice with different EmployerCode, it means that there are 2 different Employer Code for the same Lastname + first name + national ID?

Assuming your reference dataframe is like this:

lastName|birthDate|nationalId|EmployeeCode
AAA|BBB|123|Emp1
AAA|BBB|123|Emp2
DDD|EEE|789|Emp6
XXX|YYY|456|Emp5

I believe one way to approach this is to separate the reference dataframe into 2 parts

Those with more than 1 employee code for the same lastname, birthday, nationalId
Those with only 1 employee code for the same 3 columns

The code below splits it into the 2 parts mentioned

import pyspark.sql.functions as F

df = spark.createDataFrame([
    ('AAA', 'BBB', '123', 'Emp1'),
    ('AAA', 'BBB', '123', 'Emp2'),
    ('XXX', 'YYY', '456', 'Emp5'),
    ('DDD', 'EEE', '789', 'Emp6')
], ['lastName', 'birthDate', 'nationalId', 'EmployeeCode'])


aggregated_df = df.groupBy("lastName", "birthDate", "nationalId").agg(F.countDistinct("EmployeeCode").alias("distinctEmployeeCode"))
part_one = aggregated_df.filter(F.col("distinctEmployeeCode") > 1)
part_two = aggregated_df.filter(F.col("distinctEmployeeCode") <= 1)

Now can join it accordingly, inner join your dataframe with part_one and get the EmployeeUniqueID since records which matches this join has duplicated EmployeeCode.

The join with part_two will give you the corresponding EmployeeCode, any records that joins with this table will have a distinct employee Code

update value in specific row by checking condition for another column values in pyspark

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in DATAFRAME

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Trending Questions

Popular # Hahtags

Popular Questions